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The present invention relates to methods of operating feconfi- 
gurable arrays of data processing elements. 

When using sucharrays, it is desired to - optimise the way the 
array is coupled to other units r e.g. to a processor if used 
as a coprocessor and/or to optimise the way in which the array 
is configured. 

The present invention aims at providing improvments over the 
prior art. 

It is to be noted that the disclosure of the present invention 
does comprise two major parts in its description that both re- 
fer to ways of allowing for an optimum use of the array and 
hence are closely related to each other. 

It is also, to be noted that the shorter of the two major parts 
does comprise a plurality of figures that the text relates to 
however without always giving an exact, precise and correct 
reference. Yet any deviations from correct referencing will be 
obvious to" the average skilled person. 
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Hie study is concerned with three objectives: 

1 . Proposal of a hardware fiamework, which enables an efficient integration of the FACT XPP 
core into a standard RISC processor architecture, 

2 Proposal of a compiler for the coupled RiSC+XPP hardware. This compiler decides 
automatically which part of a source code is executed on the RISC processor and which part us 
executed on the PACT XPP. 

3. Presentation of a number of case studies demonstrating which results may be achieved by 
using the proposed C Compile? in cooperation with the proposed hardware framework. 

The proposed hardware framework accelerates the XPP cow in two respects. First, data throughput is 
increased by raising the XPP's internal operating frequency into the range of the RISCs ftequency. 
This, however, means that the XPP runs into the same pit like all high frequency processors - memory 
accesses become very slow compared to processor internal computations. This is why the use of a 
cache is proposed- it eases the memory access problem for a large range of algorithms, which are well 
suited for am execution on the XPP. The cache as second throughput increasing feature requires a 
controller. Hence a programmable cache controller is introduced, which manages the cache consents 
and feeds the XPP core. It decouples the XPP core computations from the data transfer so that, for 
instance, data preload to a specific cache sector takes place whale fee XPP is operating on data located 
in a different cache sector. 

Another problem emerging with a coupled RXSC+XPP hardware is concerned with the RISCs 
multitasking concept, ft becomes necessary to interrupt computations on the XPP in order to perform a 
task switch. Multitasking is supported by the proposed compiler, as well as by the proposed hardware. 
First, each XPP configuration is considered as an unifltemiptibte entity. This means that «xe compiler, 
which generates the configurations, takes care that the execution time of any configuration does not 
exceed a predefined time slice. Second, the cache controller is concerned with the saving and restoring 
of the XPP's state after, an intercept. The proposed cache concept minimizes the memory traffic for 
interrupt handling and firequentiy even allows avoiding memory accesses at all.. 

Finally, the proposed cache concept is based on a simple IRAM cell structure allowing for an easy 
scalability of the hardware - extending the XPP cache size, for instance, requires not much more than 
the duplication of IRAM cells. 

The study proposes a compiler for a RSSC+XPP system. The objective of the compote as that real- 
world applications, which are written in the C language, can be compiled for a RISC+XPP system. 
The compiler removes the necessity of developing NML code for the XPP by hand. It is possible, 
instead, to implement algorithms in the C language or to directly use existing C applications without 
much adaptation to rhe XPP system. The proposed compiler includes three major components to 
perform the compilation process for the XPP: 

1 . partitioning of the C source code into RISC and XPP parts,. 

2. transformations to optimize the code for the XPP and 
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3 . generating NML code. 
Finally the generated NML code is placed and routed for the XPP. 

Hie partitioning component of the compiler decides which parts of an application code can be 
executed on the XPP and which pacts are executed on the RISC. Typical candidates for becoming XPP 
code are loops with a large number of iterations whose loop bodies are dominated by arithmetic 
operations. The remaining source code * including the data transfer code - is compiled for the RISC. 

The proposed compiler transforms the XPP code such that it is optimized for NML code generation. 
The transformations included in the compiler comprise a large number of loop transformations as well 
as general code transformations. Together with data and code analysis the compiler restructures the 
code so that it fits into the XPP amy and that the final performance exceeds the pure RISC 
performance. Finally the compiler generates NML code from the transformed program. The whole 
compilation process is controlled by an optimization driver which selects the optimal order of 
transformations based on the source code. 

The case studies build a major aspect of the study. The selection of the examples is conducted by the 
guiding principle that each example stands for a set of typical real-world applications. For each 
example the study demonstrates the work of the proposed compiler. First the code is partitioned. The 
code transformations, which are done by the compiler, are shown and explained. Some examples 
require minor source code transformations which must be performed by hand. The study argues that 
these transformations are either too expensive, or too specific to make sense to be included in the 
proposed compiler. Dataflow graphs of the transformed codes are constructed for each example, which 
are used by the compiler to generate the NML code. In addition die XPP resource usages are shown. 

The case studies demonstrate that a compiler containing the proposed transformations can generate 
efficient code from numerical applications for the XPP. This is possible because the compiler relies on 
the features of the suggested hardware, like the cache controller. 
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2 Hardware 



2.1 Design Parameter Changes 

Since the XPP core shall be integrated as a functional unit into a standard RISC core, some system 
parameters have to be reconsidered; 

2.1.1 Pipelining /Concurrency /Synchronicity 

RISC installations of totally different type (Ld/St, ALU, Mul/Div/MAC, FPALU, FPMuL..) are 
executed in separate specialized functional units to increase the fraction of silicon that is busy on 
average. Such functional unit separation has led to superscalar RISC designs, that exploit higher levels 
of parallelism. 

Each functional unit of a RISC core is highly pipelined to improve throughput. Pipelining overlaps the 
execution of several instructions by splitting them into unrelated phases, which are executed in 
different stages of the pipeline. Thus different stages of consecutive instructions can be executed in 
parallel with each stage taking much less time to execute. This allows higher core frequencies. 

Since the pipelines of all functional units are approximately subdivided into sub-operations of die 
same size (execution time), these functional units / pipelines execute in a highly synchronous manner 
with complex floating point pipelines being the exception. 

Since the XPP core uses data flow computation, it is pipelined by design. However, a single 
configuration usually implements a loop of the application, so the configuration remains active for 
many cycles, unlike the instructions in every other functional unit, which typically execute for one or 
two cycles at most. Therefore it is still worthwhile to consider the separation of several phases (e.g.: 
Ld / Ex / Store) of an XPP configuration (= XPP instruction) into several functional units to improve 
concurrency via pipelining on this coarser scale. This also improves throughput and response time in 
conjunction with multi tasking operations and implementations of simultaneous multithreading (SMT), 

The multi cycle execution time also forbids a strongly synchronous execution scheme and rather leads 
to an asynchronous scheme, like for e.g. floating point square root units. This in turn necessitates the 
existence of explicit synchronization instructions. 

2.1.2 Core frequency /Memory Hierarchy 

As a functional unit, the XPP's operating frequency will either be half of the core frequency or equal 
to the core frequency of the RISC. Almost every RISC core currently on the market exceeds its 
memory bus frequency with its core frequency by a larger factor. Therefore caches are employed, 
forming what is commonly called the memory hierarchy: Each layer of cache is larger but slower than 
its predecessors. 
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This memory hierarchy does not help to speed up computations which shuffle large amounts of data, 
wath little or no data reuse. These computations are called "bounded by memory bandwidth". However 
other types of computations with more data locality (another flame for data reuse) gain pesforaiance as 
long as they fit into on© of the upper layers of the memory hierarchy. This is the class of applications 
that gain the highest speedups when a mexnosy hierarchy is iatroduced. 

Classical vectorization can be used to transform memory-bounded algorithms, with a data set too big 
to fit into the upper layers of the memory hierarchy. Rewriting the code to reuse smaller data sets 
sooner exposes memory reuse on a smaller scale. As the new data set size is chosen to fit into the 
caches of the memory Hierarchy, the algorithm is not memory bounded any more, yielding significant 
speed-ups. 



As the XPP as introduced into a RXSC core, the changed environment - higher frequency and the 
meraosy hierarchy - not only necessitate reconsideration of .hardware design parameters, but also a 
revaluation of the software environment. 



Hie introduction of a memoay hierarchy enhances the set of applications that can be implemented 
efficiently. So for the XPP has most&y been used for algorithms that read their data set in a linear 
.maraner, applying some calculations in a pipelined fashion and writing the data back to memory. As 
long as all of the computation fits into the XPP array, these algorithms are memo&y bounded. Typical 
applications are faltering and audio signal processing in general. 

But there is another sec of algorithms, that have even higher computational complexity and higher 
mermoBy bandwidth requirements. Examples ape picture and video processing, where a second and 
third dimension of data coherence, opens up. This coherence is e.g. exploited by picture and video 
compression algorithms, that scan pictures in both dimensions to find similarities, even searching 
consecutive pictures of a video stream for analogies. Naturally these algorithms have a much higher 
algorithmic complexity as weli as higher memory requirements. Yet they are data local, either by 
design or they can be transformed to be, thus efficiently exploiting the memoay hierarchy and the 
higher clock frequencies of processors with memory hierarchies. 

fiMtt Tn^tag 

The introduction into a standard RISC core makes it necessary to understand and support the needs of 
a multitasking operating system, as standard RISC processor are usually operated in multitasking 
environments. With multitasking, the operating system switches the executed application on a result 
basis, thus simulating concurrent execution of several applications (tasks). To switch tasks the 
operating system has to save the state (context i.e. the contents of all register ...) of the inning task 
and then reload the state of another task. Hence at is necessaxy to determine what the state of the 
processor is, and to keep ft as small! as possible to allow efficient context switches 
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Tte neW est development with RISC processors, Simultaneous MuitiThreading (SMI), adds hardware 
support for a finer granularity (instruction / functional unit level) switching of tasks, exposing more 
" than one independent instruction stream to be executed. Thus, whenever one insmiction stream stalls 
or doesn't utilize ail functional units, the other one can jump in. This improves functional unit 
utilization for today's processors. 

With SMT the task (process) switching is done in hardware, so the processor stale has to be 
duplicated in hardware. So again it is most efficient to keep the state as small as possible- Forthe 
combination of the PACT XPP and a standard RISC processor, SMT is very beneficial, since the XPP, 
configurations execute longer than the average RISC instruction. Thus another task can utilize the 
other functional units, while a configuration is running. On the other side, not every task will utilize 
the XPP, so while one such non-XPP task is running, another one will be able to use the XPP core. 



22 Communication Between the RISC Core and the 
XPP Core. 

2.2.1 Streaming 

Since streaming can only support (number_ofJO_ports * width_ofjO_poit) bits per cycle, it is only 
well suited for small XPP arrays with heavily pipelined configurations that feature few inputs and 
outputs. As the pipelines take a long time to fill and empty while the running time of a configuration is 
limited (as will be described under "context switches*), this type of communication does not scale well 
to bigger arrays and array frequencies near the RISC core frequency. 

* Streaming from the RISC core 

In this setup, the RISC supplies the array with the streaming data. Since the RISC core has to 
execute several instructions to compute addresses and load an item from memory, this setup is 
only suited, if the XPP core is reading data with a frequency much lower than the RISC core 
frequency. 

■ Streaming via DMA 

In this mode the RISC core only initializes a DMA channel which then supplies the data items 
to the streaming port of the XPP core, 

2.2.2 Shared Memory (Main Memory) 

In this configuration the XPP array configuration uses a number of PAEs to generate an address that is 
used to access main memory through the IO ports. As the number of lO ports is very limited this 
approach suffers from tide same limitations as the previous one, although for larger arrays the impact 
of using PAEs for address generation is lessening. However this approach is still useful for loading 
values from very sparse arrays. 

2.2.3 Shared Memory (IRAM) 

This data access mechanism uses the IRAM array elements to store data for local computations. The 
IRAMs can either be viewed as vector registers or as local copies of main memory. 

There are several ways to fill the IRAMs with data. 
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The XRAMs can be loaded m advance by separate "toad** configurations using streaming. 

This corresponds to the usage as vector registers. As explicated above, this will limit the 
performance of the XPP amiy, especially as the IRAMs will always be .part of the externally 
visible state asid fsence must be saved and restored obi cosntext switches. 

The IRAMs can be loaded in advance by separate "load" iftstmctiom. 

This can be viewed as hard coded load configuration and reduces configuration reloads. 
Additionally,, the special load instructions may use a wider interface to the memory hierarchy. 
This con-espoads to the usage as vector register. 

The IRAMs can be loaded by a c *bwrst preload from memory" instruction of the cache 
controller. 

The best mode however is a combiaation of the previous solutions wife the extension of a 
cache: 

A preload instruction maps a specific memory area defined by starting address and size to 
JRAMx. This triggers a (delayed, low priority) burst load from the memory hierarchy (cache). 
After alt IRAMs are mapped, the next configuration can be activated. The activation incurs a 
wait until all burst loads are completed. However, if the preload Instructions are issued long 
enough in advance and no intmupt or task switch destroys cache locality, the wait will not 
consume any tame. 

To specify a memory block as output SRAM, a "preload clean* instruction is used, which 
avoids loading data from memory. 

A synchronization instruction is needed to make sure that the content of a specific memory 
area, which is cached to SRAM, is written back to the memory hierarchy. This can be done 
globally (full write back), or selectively by specifying the memory area, which will be 
accessed. ' 

We propose an XPP integration as an asynchronously pipelined functional unit for the RISC We 
further proposes*, explicitly preloaded cache for the IRAMs, on top of the memory hierarchy existing 
wrtfan the RISC. Additionally a decentralized explicitly preloaded configuration cache within the 
PAE array is employed to support preloading of configurations and fast switching between 
configurations. 

Since the IRAM content is an explicitly preloaded memory area, a virtually unlimited number of sucn 
IRAMs can be used. They am identified by their memory addiress and their size. The IRAM content is 
explicitly preloaded by the application. Caching/will increase performance by reusing data from the 
memo^ hierarchy. The cached operation also eliminates the need for explicit store instructions: they 
are handled implicitly by cache write back operations but can also be forced for synchronization. 
The pspelbe stages of the XPP functional unit are Load, Execute and Write Back (Store). The store is 
executed dekyed as a cache write back. The pipeline stages execute in an asynchronous fashion, thus 
hiding the variable delays from the cache preloads" and the PAE array. 

S A^eS^Z^ 1 '•iSSfSi? ^ ^ * t Wh L Ch " ** *• »* ^tractions. 

« c I ■ ™ ' ? e *S? PAE consumes ™ d executes the configurations and the preloaded 
IRAMs. Synchronization of the XPP and the RISC is done explicitly by a synchronization inaction 



In the following we define the instruction formats- needed for the proposed architecture We use a C 
style prototype defMiMoa to specify data types. All instructions, except the XPPSync instruction 
execute asynchronously. The XPPSync instruction can be used to force synchronization^ M,5aUCUon 
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XPPFreloadCoafig (void *ConfiguratiouStartAddress) 

The configuration is added to the preload FIFO to be loaded into the configuration cache within the 
PAE array- 
Note that speculative preloads are possible, since successive preload commands overwrite the 
previous. 

The parameter is a pointer register of the RISC pointer register file. The size is implicitly contained in 
the configuration. 

XPPPreload (int IRAM, void *StartAddress, int Size) 
XPPFreloadCIean (int IRAM, void *StartAddress, int Sim) 

This instruction specifies the contents of the IRAM for the next configuration. In feet, the memory 
area is added to the preload FIFO to be loaded into die specified IRAM. 

The first parameter is the IRAM number. Tliia is an immediate (constant) value; 

The second parameter is a pointer to the starting address. This parameter is provided in a pointer 

register of the RISC pointer register file. ^ 

The third parameter is the size. This is an integer value. It resides in a general-purpose register of the 
RISC's integer register file. 

The first variant actually preloads the data from memory. 

The second variant is for write-only accesses. It skips the loading operation. Thus no cache misses can 
occur for this IRAM. Only the address and size are defined. They are obviously needed for the write 
back operation of the IRAM cache. 

Note that speculative preloads are possible, since successive preload commands to the same IRAM 
overwrite each other. Thus only the last preload command is actually effective, when the configuration 
' is executed. 

XPPExecute 0 

this instruction executes the last preloaded configuration with die last preloaded IRAM contents. 
Actually a configuration start command is issued to the FIFO. Then the FIFO is advanced; this" means 
that further preload commands will specify the next configuration or parameters for the next 
configuration. Whenever a configuration finishes, the next one is consumed from die head of the 
FIFO, if its start command has. already been issued. 

XPPSync (void *StartAddress, int Size) 

This instruction forces write back operations for all IRAMs that overlap the given memoiy ai^a. If 
overlapping IRAMs are still in use by a configuration or preloaded to be used, this operation will 
block. Giving an address of NULL (zero) and a si2e of MAXJNT (bigger than the actual memory), 
this instruction can also be used to wait until all issued configurations finish. 
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RISC 




Figure 1: Memory interface 
Hie XPP core shares the memory hierarchy with the RISC core using a special cache controller. 



tOT (ixtt l-=0; i <1 OOP; i-+|> \ 
XPPPieloadCo n r?g< CONFIG! ); " 
XPPPreloo4( XRAM2, 0*20000. 30 ); 
XPPPreload( XRAMO f 0x20400. 200 ); 
XPPPrclaadC(c*n< (RAM J, 0x20000. 10 ); 

r 

other RISC computations 

fa the meanwhile the burst preloads and 

the previous configuration are running 

XPPExecutc(CONFU?l ): 
/• 

Other RJSC computations 

maybe bunt preloads of 

another configuration nnd other data 

) 

Note: in all places where constants arc used, 
the value should actually come, from a register 



Le gend 





Figure 2 IRAM & configuration cache controller data structures and usage example 



? a i ove ngure <xntahl roe addresses and sizes for already issued IRAM 
preloads, exposing them to the XPP cache controller. The FIFOs have to be iSSJS for^erv 
vutual processor m an SMT environment Tag is the typical tag for a cache n~SS^£Z 
address. stze and state frnp* I clean I dirty I in-use). The additional in-use state siSSa Site 
current configuration. The cache controller cannot manipulate these IRAM instances 
The execute configuration command advances all preload FIFOs, copying the old 'state to the newlv 
created entry. This way the following preload, replace the ^ou^ed ; BfcV^ £ES £E2 
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If ho preload is issued for an TRAM, the previous information is retained, eliminating identical preload 
commands. 




RISC 
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Ld / St 
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XPP units 



RAM ./-^ 



Figure 3: Asynchronous pipeline of th© XPP 



Each configuration's execute command has to be delayed (stalled) until all necessary preloads are 
finished,, either explicitly by the use of a synchronization command or implicitly by the cache 
compiler. Hence the cache controller (XPP Ld/St unit) has to handle the synchronization and execute 
commands as well, actually starting the configuration as soon as all date is ready. After the termination 
of the configuration, dirty SRAMs are written back to niemoiy as soon as possible, if their content h 
not reused in the same IRAM. lie XPP PAE array and the XPP cache controller can therefore be seen 
as a single unit since they do not have different instruction streams: rather, the cache controller can be 
seen as the configuration fetch (CF), operand fetch (OF) (IRAM preload) and write back (WB) stage 
of the XPP pipeline, also triggering the execute stage (EX) (PAE array). 

Due to the long latencies, and their nim-predictability (cache misses, variable length configurations), 
the stages can be overlapped several! configurations wide using the configuration and data preload 
FIFO (pipeline) for loose coupling. So if a configuration is executing and the data for the next has 
already been preloaded, the data for the next but one configuration is preloaded. Hiess preloads caira be 
speculative; the amount of speculation is the compiler's trade-off. The length of the preload FIFO can 
be several configurations; it is limited by diminishing returns, algorithm properties and the compiler's 
ability to schedule preloads early. Due to this loosely coupled operation, the interlocking cannot be 
done optimally by software (scheduling), but has to be enforced by hardware (hardware interlocking). 
Hence the XPP cache controller and the XPP PAE amiy can be seen as separate but not totally 
independent functional units. 
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Hie XPP cache controller has several tasks. These are depicted as states in the above diagram. State 
£l*kL tf» along the edges between states, whenever the condition for** edge * t™*. As 
STSfte ooXon is no? true any more, me reverse state transition takes place. The actmfaes for the 
states affe as follows: 

At the lowest priority, the XPP cache controEesr has to fulfill already issued pre!oad commands, while 
writing back dirty DRAMs as soon as possible. 

As soon as a configuration finishes, the next configuration can be started. This isamore urgent task 
^ writfblS oSSL preloads. To be able to do that, all associated yet unsattsfied preloads have 
^SleTfct ThSy ^ preloaded with the high priority inherited from the e*ec«te state. 

a «, can he blocked bv am overlapping in-use or dirty IRAM instance in a different block 

t^ S acS^ ?STi«s I Z target IRAM block. The former ca* be resolved by 
ZmL* forme oonfiSion to finish and / or by a write back. To resolve the latter, the least receatjy 
Stfean KAM Steamed, thus becomes empty. If no empty or clean IRAM instance e^, 
Itirfy onVhas to bTwritten bac^^^ory^hierarehy. It cannot occur that no empty, clem or 



dirty IRAM instances exist, sfo&| 
instance in an IRAM block - otfir 
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In an SMT environment the load FIFOs have to be replicated for eveiy virtual processor. The pipelines 
of the functional units are fed from the common fetch / reorder / issue stage. All functional units 
execute in parallel. Different units can execute instructions of different virtual processors. 

22M ¥w\s^mlm$mmmm^ 



To further decrease the penalty for unloaded IRAMs, a simple write pointer may be used per IRAM, 
which keeps track of the last address already in the IRAM. Thus no stall is required, unless an access 
beyond this write pointer is encountered. This is especially useful, if all BRAMs have to be reloaded 
after a task switch: The delay to the configuration start can be much shorter, especially, if the preload 
engine of the cache controller chooses the blocking IRAM next whenever several IRAMs need 
reloading 

(Longer (FIFOs 

The frequency at the bottom of the memory hierarchy (main memory) cannot be raised to the same 
extent as.the frequency of the CPU cere. To increase the concurrency between the RISC core and the 
PACT XPP core, the prefetch FIFOs in the above drawing can be extended. Thus the IRAM contents 
for several configurations can be preloaded, lake the configurations themselves. A simple convention 
makes clear which IRAM preloads belong to which configuration: the configuration execute switches 
eo the next configuration content This can be accomplished by advancing the FIFO write pointer with 
e ve*y configuration execute, while leaving it unchanged after every preload. Unassigned IRAM FIFO 
entries keep their contents from the previous configuration, so eveiy succeeding configuration will use 
the preceding confaguration's IRAMx if no different IRAMx was preloaded. 

If none of the memory areas to be copied to IRAMs is in any cache, extending the FIFOs does not 
help, as the memory is the bottleneck. So the cache size should be adjusted together with the FIFO 
length. 

A drawback of extending the FIFO length is tthe increased likelihood that the IRAM content wrfetea by 
an earlier configuration is reused by a later one in another IRAM. A cache coherence protocol can 
clear the situation. Note however that the situation can be resolved more easily: If an overlap between 
any new KRAM area and a currently dirty IRAM contents of another IRAM bank is detected, the new 
IRAM is simply not loaded until the write back of the changed IRAM has finished. Urns the execution 
of the jj*ew configuration is delayed until the correct data is available. 

For a short (single entry) FIFO, an overlap is extremely unlikely, since the compiler will usually leave 
the output IRAM contents of the previous configuration in place for the next configuration to skip the 
preload. The compiler does so using a coalescing algorithm for the IRAMs / vector registers. 



The IRAMs are block-oriented structures, which can be read in any order by the PAE array. However, 
the address generation adds complexity, reducing the number of PAEs available for the actual 
computation. So it is best, if the IRAMs are accessed in linear order. The memory hierarchy is block 
oriented as well, further encouraging Smear access patterns in the code to avoid cache misses. 

As the IRAM read ports limit the bandwidth between each IRAM and the PAE amy to one word read 
per cycle, it can be beneficial to distribute the data over several IRAMs to remove this bottleneck- The 
top of the memory hierarchy is the source of the data, so the amount of cache misses never increases 
when the access pattern is. changed, as long. as the data locality is not destroyed. 
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Many algorithm, access memo* in linear order by definition to utilize block reading and^a 
ang ulations In most other cases and in the cases where loop tiling is needed to ^crease the 
M^SSSSmo SrAMs and the PA£ array, the code can be transformed in 
S accessed in optimal order- in many of the remaining cases, the compder mm £»W the ac*» • 
pattern by data layout rearrangements, so that finally the data is accessed in the desired pattern. If rwne 
S oprmifaSons can be used because of dependencies, or because the data layout rs fixed, there 
are still two possibilities to improve performance: 

Data Duplication: 

Data is duplicated in several IRAMs. This circumvents the IRAM read port bottleneck, allowing 
several data items to be read from the input every cycle. 

Several options are possible with a common drawback: data duplication can only be applied to input 
data: output IRAMs obviously cannot have overlapping address ranges. 

o Using several IRAM preload commands specifying just different target IRAMs: 

* . 
This way cache misses occur only for the first preload. All other preloads will take place without 
cache trusses - only the time to transfer the data from the top of Ae memory hierarchy to fee 
IRAMs is needed for every additional load. This is only beneficial, rf the cache misses plus the 
additional transfer times do not exceed the execution time for the configuration. 

o Using an IRAM preload instruction to load multiple IRAMa concurrently: 

As identical data is needed in several IRAMs, they .can be loaded concurrently by writingthe 
to all of them- This amounts to finding a clean IRAM irurtance for every target 
IRAM, connecting them all to the bus and £ £JS£ 

Tl.epn.blem with mis instruction is that it requires a bigger Immerhate field for the «*««*» 
(16 bits instead of 4 for the XPP 64). Accordingly this instruction format grows at a higher rate, 
when the number of IRAMs is increased for bigger XPP arrays. 

The interface of this instruction looks like: 

XPPPreloadMultiple (int IRAMS, void *StartArMress, int Size) 

This instruction behaves as the XPPPreload / XPPPreloadClean instructions with the exception 

Tl^SSXt IRAMS. This is an immediate (constant) value. The value is a bitmap - for 
Ivery Wt KS^X IRAM with that number is a target for the load operanon. 

There is no -clean" version, since data duplication is applicable for read data only. 
Data Reordering; 

pata reordering changes the access pattern to the data only. It does not change the amount of memory 
that ia read. Thus the number of cache misses stays the same. 

o Adding additional functionality to the hardware: 

o Adding a vector stride to the preload instruction. 

A stride (displacement between two elements in memory) is used to vector load 
operations to load e.g-: a column of a matrix into a vector register. 
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identification state. One problem with this instruction is that the number of possible 
cache misses per IRAM load rises: In the worst case it can be one cache miss per 
loaded value, if the stride is equal to the cache line size and all data is nor in the cache. 
But as already stated: the total number of misses stays the same - just the distribution 
changes. Still this is an undesirable effect. 

The other problem is the complexity of die implementation and a possibly limited 
throughput, as the data paths between the layers of the memory hierarchy are 
optimized for block transfers. Transferring non-contiguous words will not. use wide 
busses in an optimal feshion. 

The interface of the instruction looks like: 

XPFPreloadStride (int IRAM, void *StartAddress, int Size, int Stride) 

XPPPreloadCleaaStride (fat IRAM, void *StartAddress, int Size, int Stride) 

<. » . 

This instruction behaves as the XFPPireload / XPPPreloadClean instructions with the 
addition of another parameter 

The fourth parameter is the vector stride. This is an immediate (constant) value. It tells 
the cache controller, to load only every n* value to the specified IRAM 

o Reordering the data at run time, introducing temporary copies. 

o On the RISC: 

The RISC can copy data at a maximum rate of one word per cycle for simple address 
computations and at a somewhat lower rate for more complex ones. 

With a memory hierarchy, the sources will be read from memory (or cache, if they 
were used recently) once and written to the temporary copy, which will then reside in 
the cache, too. This increases the pressure in the memory hierarchy by the amount of 
— ~ • - memory used for the temporaries. Since temporaries are allocated on the stack 
memory, which is re-used frequently, the chances are good that die dirty memory area 
is re-defined before it is written back to memory. Hence the write back operation to 
memory is of no concern. 

o Via an XPP configuration: 

The PAE array can read and write one value from every IRAM per cycle. Thus if half 
of the IRAMs are used as inputs and half of the IRAMs are used as outputs, up to 
eight (or more, depending on the number of IRAMs) values can be reordered per 
cycle, using the PAE array for address generation. As the inputs and outputs reside in 
IRAMs, it does not matter, if the reordering is done before or after the configuration 
that uses the data - the IRAMs can be reused immediately. 



2.3 State of the XPP Core 

As described in the previous section, the size of the state is crucial for the efficiency of context 
switches. However, although the size of the state is fixed for the XFP core, it depends on the 
declaration of the various state elements, whether they have to be saved or not 
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Hie state of the XPP core can be classified as 

1 Read only (instruction data) 

■ configuration data, consisting of PAE configuration and routing configuration data 

2 Read -Write 

n the contents of the data registers and latches of the PAEs, which are driven onto the busses 

■ the contents of the IRAM elements 

2.3.1 Limiting Memoiy Traffic 

There are several possibilities to limit the amount of memory traffic during context switches. 

Do not save read-only data 

This avoids storing configuration data, since configuration data is read only. The current configuration 
is simply overwritten by the new oxie. 

Save less data 

If a configuration is defined to be uninterruptible (non pre-emptive), all of the local state on the busses 
and in the PAEs can be declared as scratch. This means that every configuration gets its input data 
from the IRAMs and writes its output data to the IRAMs. So after the configuration has finished ail 
information in the PAEs and on the buses is redundant or invalid and does not have to be saved. 

Save modified data only 

To reduce the amount of R/W data, which has to be saved, we need to keep track of the modification 
state of the different entities. This incurs a silicon area penalty for the additional "dirty" bits. 

Use caching to reduce the memory traffic 

The configuration manager handles manual preloading of configurations. Preloading will help in 
parallelizing the memory transfers with other computations during the task switch. This cache can also 
reduce the memory traffic for frequent context switches, provided that a Least Recently Used (LRU) 
replacement strategy is implemented in addition to the preload mechanism. 

The IRAMs can be defined to be local cache copies of main memoiy- Then each IRAM is associated 
with a starting address and modification state information. The IRAM memory cells should also be 
replicated as for the SMT support. Ihen only the starting addresses of the IRAMs have to be saved and 
restored as context The starting addresses for the IRAMs of the current configuration select the IRAM 
instances with identical addresses to be used. 

If no address tag of an IRAM instance matches the address of the newly loaded context; the 
corresponding memory area is loaded to an empty IRAM instance. 

If no empty IRAM instance is available, a clean (unmodified) instance is declared empty (and hence 
must be reloaded later on). 

If no clean IRAM instance is available, a modified (dirty) instance is cleaned by writing its data back 
to main memory. This adds a certain delay for die write back. 

Tilts delay can be avoided, if a separate state machine (cache controller) tries to clean inactive IRAM 
instances by using unused memory cycles to .write back the IRAM instances* contents. 
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Usually a processor is viewed as executing a single steam of instructions. But today's multi taskiag 
operating systems support hundreds of tasks being executed on a single processor. This is achieved by 
switching contexts, where all, or at least the most relevant pasts of the processor state, which belong to 
the current task - the task's context - as exchanged with the ssate of another task, that will be executed 
next. 

There are three types of context switches: switching of virtual processors wfch simultaneous 
multithreading (SMT, also known as HyperThreading), execution of an Interrupt Service Routine 
(ISR) and Task switch. 

IAD saffif Wmml {Processor twKeh 

r" 

This type of context switch is executed without software interaction, totally in haniware. Instructions 
of several! instruction streams are merged into a single instruction stream to increase iastrfcctioia level 
parallelism aad improve functional unit utilization. Hence the processor state cannot be/stored to as&d 
reloaded from memory between instructions from different instruction streams: imagine'the worst case 
of alternating instructions from two streams and the hundreds to thousand of cycles needed to write the 
processor state to memory and read in another state. 

Hence hardware designers have to replicate the internal state for every virtual processor. Every 
instruction is executed within the context (on the state) of the virtual processor, whose progsam 
counter was used to fetch the instruction. By (replicating the state, only the multiplexers, which have to 
be inserted to select one of the different states,, have to be switched 

Thus the size of the state also increases the silicon area needed to implement SMT, so the size of the 
state is crucial for many design decisions. 

For the design presented is the previous sections, the state is minimal, thus enabling efficient 
implementation s - 
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This type of context switch is handled partially by hardware and partially by software. All of the state 
modified by the ISR has to be saved on entry and must be restored on exit. 

The part of the state, which is destroyed by the jump to the ISR, is saved by hardware (e.g. PC). It is 
the ESR's responsibility to save and restore the state of all other resources, that are actually used within 
the ISR, 

The more state information to be saved, the slower She intenrupt response time -will be and the greater 
the performance impact will be if eternal events trigger mtetrrupts at a high rate. 

The execution model of the instructions will also affect the tradeoff between short interrupt latencies 
and maximum throughput: Throughput is maximized if the instructions in the pipeline are finished, 
and the insmsctions of the ISR are chained. This adversely affects die interrupt latency. If, however 
the instructions are abandoned (pre-empted) in favor of a short interrupt latency, they muse be fetched 
again later, which affects throughput The third possibility would be to save the internal state of the 
instructions within the pipeline, but this requires too much hardware effort, so this is mot done 
normally. 
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2.« Task Switch 

This typ© of context switch is executed totally in software. All of a task's context (state) has to be 
saved to memory, and the context of the new task has to be reloaded. Since tasks are usually allowed 
to use all of the processor's resources to achieve top performance* all of the processor state has to be 
saved and restored. If the amount of state is excessive, the rate of context switches must be decreased 
by less frequent rescheduling, or a severe throughput degradation will result, as most of the time will 
be spent in saving and restoring task contexts. This in turn increases the response time for the tasks. 

2.5 Software / Hardware Interface 

According to the design parameter changes and the corresponding changes to .trie hardware, the 
hardware / software interface has changed. In the following the most prominent changes and their 
handling will be discussed: 

2.5.1 Explicit Cache 

The proposed cache is not a usual cache, which would be - not considering performance issues - 
invisible to the programmer / compiler, as its operation is transparent The proposed cache is an 
explicit cache. Its state has to be maintained by software. 

Cache Consistency & Pipelining of Preload I Configuration / Write back 

The software is responsible for cache consistency. It is possible to have several IRAMs caching the 
same, or overlapping memory areas. As long as only one of the IRAMs is written, this is perfectly ok- 
Only this IRAM will be dirty and will be written back to memory. If however more than one of the 
IRAMs is written, it is not defined, which data will be written to memory. This is a software bujz fnon 
deterministic behavior). 

As the execution of the configuration is overlapped with the preloads and write backs of the IRAMs it 
is possible to create preload / configuration sequences, that contain data hazards. As the cache 
controller and the XPP array can be seen as separate functional units, which are effectively pipelined, 
these data hazards are equivalent to pipeline hazards of a normal instruction pipeline. As with any 
ordinary pipeline, there are two possibilities to resolve this: 

• Hardware interlocking: 

Interlocking is done by the cache controller If the cache controller detects, mat the tag of a 
dirty or m-use item in IRAMx overlaps a memory area used for another IRAM preload, it has 
to stall that preload, effectively serializing the execution of the current configuration and the • 
preload. 

• Software interlocking; 

If the cache controller does not enforce interlocking, the code generator has to insert explicit 
synchronize instructions to take care of potential interlocks. Inter- procedural and inter- 
modular alias- and data- dependency analyses can determine if this is the case, while 
scheduling algorithms help to alleviate the impact of the necessary synchronization 
instructions. 
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In either case, as well as in the case of pipeline stalls due to cache misses, SMT can use the 
computation power, that would be wasted otherwise. 

Code Generation for the Explicit Cache; 

Apart from the explicit synchronization instructions issued with software interlocking, the following 
instructions have to be issued by the compiler. 

• Configuration preload instructions, preceding the IRAM preload instructions, that will be used 
by that configuration. These should be scheduled as early as possible by the instruction 
scheduler. 

• IRAM preload instructions, which, too, should be scheduled as early as possible by the 
instruction scheduler. 

• Configuration execute instructions, following the IRAM preload instructions for that 
configuration. These instructions should be scheduled between the estimated minimum and the 
estimated maximum of the cumulative latency of their preload instructions. 

Asynchronicity to Other Functional Units 

A configuration wait instruction followed by an instruction forcing a cache write back must be issued 
by the compiler, if an instruction of another fractional unit (mainly the Ld/St unit) can access a 
memory area, that is potentially dirty or in-use in an IRAM. This forces a synchronization of the 
instruction streams and the cache contents, avoiding data hazards. A thorough inter-procedural and 
inter-modular array alias analysts limits the frequency of these synchronization instructions to an 
acceptable level. 
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3 Program Optimizations 



3.1 Code Analysis 

In this section we describe the analyses that can be performed on programs. These analyses are then 
used by different optimizations. They describe the relationships between data and memory locations in 
the program- More details can be found in several books [2,3 9 5]. 

3,1.1 Data-Flow Analysis 

Data-flow analysis examines the flow of scalar values through a program, to provide information 
about how the program manipulates its data. This information can be represented by dataflow 
equations that have the following general form for object/, that can be an instruction or a basic block, 
depending on the problem to solve: 

Exiq = Frodp] Y Supp[i\) 

It means that data available at die end of the execution of object U Ex[iJ, are either produced by /, 
ProdfiJ or were alive at the beginning of/, InfiJ, but were not deleted during the execution off, 
SuppfiJ. 

These equations can be used to solve several problems like: 

■ the problem of reaching definitions, 

■ the Def-Use and Use-Def chains, describing respectively for a definition, all uses that can be 
reached from it, and for a use all definitions that can reach h. 

• the available expressions at a point in the program, 

* the live variables at a point in the program, 

whose solutions are then used by several compilation phases, analysis, or optimizations* 

As an example let us take the problem of computing the Def-Use chains of the variables of a program. 
This information can be used for instance by the data dependence analysis for scalar variables or by 
the register allocation. A Def-Use chain is associated to each definition of a variable and is the set of 
all visible uses from this definition. The data-flow equations presented above are applied to the basic 
blocks to detect the variables that are passed Scorn one block to another along the control-flow graph. 
In the figure below, two definitions for variable x are produced: SI in Bl and S4 in B3- Hence the 
variable that can be found at the exit of Bl is ExfBlJ **{x(Sl)} r and at the exit of B4 is Ex(B4)-{x(S4)}. 
Moreover we have Ex(B2)~Ex(Bl) as no variable is defined in B2. Using these sets, we find that the 
uses of x in 52 and S3 depend on the definition of x in BJ 7 that the use of x in SS depend on the 
definitions of x in Bl and B3. The Def-use chains associated with the definitions are then 
0(51) = {S2,S3,S5> and D(S4) = {S5). 
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Figaro 6: Control-flow graph of a piece of program 



3/1.2 Data Dependence Analysis 

A data dependence graph represents the dependences existing between operations writing or reading 
the same data. This graph is used for optimizations like scheduling, or certain loop optimizations to 
test their semantic validity. The nodes of the graph represent the instructions, and the edges represent 
the data dependences. These dependences can be of three types: true (or flow) dependence when a 
variable is written before being read, anti-dependence when a variable is read before being written, 
and output dependence when a variable is written twice. Here is a more formal definition [3]. 

Definition 

Let S and S' be 2 statements, then S' depends on 5, notedS S 9 iff: 

(1) S is executed before S" 

(2) 3v*VAR:ve DEF(S) I V5E(S<) vv € USE{S) I DEF(S % ) we DEF{S) I DEF{S') 

(3) There is no statement T such that S is executed before T and T is executed before S\ and 
v e DEF(T) 

Where VAR is the set of the variables of the program, DEF(S) is the set of the variables defined by 
instruction S, and USEQS) is the set of variables used by instruction 5, 

Moreover if the statements are in a loop, a dependence can be loop-independent or loop-carried. This 
notion introduces the definition of the distance of a dependence. When a dependence is loop- 
independent it means that it occurs between two instances of different statements in the same iteration, 
and then its distance is equal to 0. On the contrary when a dependence occurs between two instances in 
two different iterations the dependence is loop-caniecL, and the distance is equal to the difference 
between the iteration numbers of the two instances. 
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The notion p£ direction of dependence generalizes the notion ofdjstance, and ^ S^X"^ w ^n 
the distance of a dependence is not condor cannot be computed wth precision. The d""""™ 
d^end^ce is giv^by ^ if the dependence betweenSand S' occurs when the instan^of 5 « S in an 
KoXfbre the nation of the instance of*', = if the two instances ^ara in the same rteraUon, and > 
if the instance of S is an iteration afterthe iteration of the instance of 5. 

In the case of a loop nest, we have then distance and direction vector, with one element for each level 
of Ae ZTp nest. The figures below illustrate all these definitions. The data dependence graph is used 
by a lot of optimizationWand is also useful to determine if their application is valid. For instance a 
loop can be vectorized if its data dependence graph does not contain any cycle. 



for(i=0? i<N; i=i+l) { 
S: a[i] - b[i] + If 
Sls c[il - atil + 2; 
> 




Figure 7: Example of a true dependence with distance 0 on array a 

for (1=0; i<N; i=i+D ( 
S: a[l] a b[i] + 1; " 
SI b{iJ - cti] + 2;. 




Figure 8: Example of an anti-dependence with distance 0 an array b 

for(i«0; 1<:n; i=i+l) { 
S: atil = b[i] + 1» 
SI: a[i] - c[i] + 2; 




Figure 9: Example of an output dependence with distance 0 on array a 
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- ffpr<i»0;i<*»N;i++) 
( 

.c[i][j] - 0; ■ 
for ( K-0 ; k<=N; k++) 

c[il[j] = + a[i][k)-b[kl[j]; 



SI: 



S2: 




Figure 70; Example of a dependence with direction vectorfa-) 
between 51 and 32 and a dependence with direction vector 
between S2 and S2> 



for (i=0 r -i<«N;i++) 

5: a[i)[j] - a[i][j+2] + 




Figure I /: Example of an anti-dependence with distance vector (0,2). 



3.1.3 Interprocedural Alias Analysis 



Hie aim of alias analysis is to determine if a memory location is aliased by several objects, like 
variables or arrays, in a program. It has a strong impact on data dependence analysis and on the 
application of code optimizations. Aliases can occur with statically allocated data, like unions in C 
where all fields refer to the same memory area, or with dynamically allocated data, which are the usual 
targets of the analysis. In Figure 12, we have a typical case of aliasing where p alias b. 

int b[l00],*p; 

forlp=b;p < Sb[l00];p++) 
«p=0 ; 



Figure 12: Example for typical aliasing 

Alias analysis can be more or less precise depending on whether or not it takes the control-flow into 
account When it does, it is called flow-sensitive, and when it does not, it is called flow-insensitive. 
Flow-sensitive alias analysis is able to detect in which blocks along a path two objects are aliased. As 
it is more precise, it is more complicated and more expensive to compute. Usually flow-msensitive 
alias infonnation is sufficient. Ibis aspect is illustrated inFigure 13 where a flow-insensitive analysis 
would find that p alias b, but where a flow-sensitive analysis would be able to find that/? alias b only 
in block 52. 

Furthermore aliases are classified into must-aliases and may-aliases. For instance, if we consider flow- 
insensrtive may-alias information, then* alias y, iff* and y may, possibly at different times, refer to 
the same memory location. And if we consider flow-insensitive must-alias information^ alias y, iff* 
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and y must, throughout the execution of a procedure, refer to the same storage location. In the case of 
Figure 13, if we consider flow-insensitive may-alias information, p alias b holds, whereas if we 
consider flow-insensitive must-alias information,/? alias b does not hold. The kind of information to 
use depends on the problem to solve. For instance, if we want to remove redundant expressions or 
statements, must-aliases must be used, whereas if we want to build a data dependence graph may* 
aliases are necessary. ' 








B3 


<usesofbandp> 






*P « maUocQ; 




<usesofbandp> 




Figure t3:Exampie of control-flow sensitivity 



Finally this analysis must be interproceduml to be able to detect aliases caused by non-local variables 
and parameter passing. The latter case is depicted in Figure 14 where / and j are aliased through the 
function call where k is passed twice as parameter. 

void footint *i,int* j) 
*i - *j+l; 

> 

£oo(&k,*K); 



Figure 1 4: Example /or aliasing by parameter passing 



3.14 Interprocedural Value Range Analysis 

This analysis caa find the range of values taken by the variables. It can help to apply optimizations like 
dead code elimination, loop unrolling and others. For this purpose it can use infbnnation on the types 
of variables and then consider operations applied on these variables during the execution of the 
program. Thus it can determine for instance if tests in conditional instruction ate likely to be met or 
not, or determine the iteration range of loop nests. 

This analysis has to be interproceduml as for instance loop bounds can be passed as parameter* of a 
function, like m the following example. We know by analyzing the code that in the loop executed with 
array a, ivis at least equal to 1 1, and that in the loop executed with array b, is at most equal to 10 
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void foo(int *c„int N) 
{ 

int i; 
for (i=0;i<N;iH) 

c[i] =» g(i,2>; 

) 



if (N > 10) 

foo(a,N) ; 
else 

foo(b,N); 

The value range analysis can be supported by the programmer by giving further value constraints 
which cannot be retrieved from the language semantics. This can be done by pragmas or a compiler 
known assert function. 

3.1 J5 Alignment Analysis 

Alignment analysis deals with data layout for distributed memory architectures. As stated by Saman 
Amarasinghe: "Although data memory is logically a linear array of cells, its realization in hardware 
can be viewed as a multi-dimensional array. Given a dimension in this array, alignment analysis will 
identify memory locations that always resolve to a single value in that dimension. For example, if the 
dimension of interest is memory banks, alignment analysis will identify if a memory reference always 
accesses the same bank". This is the case in the second part of rhe figure below, that can be found in 
(10], where all accesses, depicted in blue, occur to the same memory bank, whereas in the first part, 
the accesses are not aligned. He adds then that: "Alignment information is useful in a variety of 
compiler-controlled memory optimizations leading to improvements in programmabiltty, performance, 
and energy consumption.' 9 
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Alignment analysis, for instance, is able to help find a good distribution scheme of the data and is 
furthermore useful for automatic data distribution tools. An automatic alignment analysis tool can be 
able to automatically generate alignment proposals for the arrays accessed in a procedure and thus 
simplifies the data distribution problem. This can be extended with an interprocedund analysis taking 
into account dynamic realignment. 
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Alignment analysis can also be used to apply loop alignment that transforms the ~ d «^f^ 
than the data layout in itself; as shown later. Another solution can be used for ±e ™ C J 
™Til« fert that it can handle alizned code very efficiently. It consists in adding a conditional 
iiZ S32L «K» loop body are aligneJ followed, by the necessary number of 
Sd iSS of the loop body, then unaligned loop body, and then some compensation code 
?Sy meSaed ide is then executed by the PACT XPP, the rest is executed by the host pnmr. V 
the alignment analysis is more precise (inter-procedural or inter-modular) less conditional code has to 
be inserted. 

3.2 Code Optimizations 

Most of the optimizations and transformations presented here can be found in detail in [4], and also in 
C2,3,5]. 

3.2.1 General Transformations 

We present in this section a few general optimizations that can be applied to straightforward code, and 
to loop bodies. These are not the only ones lhat appear in a compiler, but they are mentioned in the 
sequel of this document. 

Constant Propagation 

This optimization propagates the values of constants into the expressions using them throughout the 
program. This way a lot of computations can be done statically by the compiler, leaving less work to 
be done during the execution, this part of the optimization is also known as constant folding. 

N s8 236; for(i=0; i<- 256; i++) 

c - 3; afi] » btij + 3; 

f OE (i=.0;i <= N;i++) 
a£i] = b(i] + o; 

Figure 15: Example of constant propagation 

Copy Propagation 

. This optimization simplifies the code by removing redundant copies of the same variable in the code. 
These copies can be produced by the programmer himself or by other optimizations. This optimization 
reduces the register pressure and the number of register-to-register move instructions. 

t = i*4; * "■ i* 4? 

IRt ; for (i-0 Pi <= N;i++) 
for (1=0; i <= N;i++) a[t] » bit] + a[i]; 

a[r) = b[r) * ati]? 

Figure 16: Example of copy propagation 

Dead Code Elimination 

This optimization removes pieces of code thai will never be executed. Code is never executed if .it is in 
the branch of conditional statement whose condition is always evaluated to true or false, or if it is a 
loop body, whose number of iterations is always equal to 0. 
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Code updating variables, that are never used, is also useless and can be removed as well If a variable 
is never used, then the code updating it and its declaration can also be eliminated. 

for(i=0;i'<:» N;i++) { for(i=0;i <= N/1++) { 

for{j=0;j<0;j++) £or(j=0,-3<10;j++) 

a£j] =b[j] + a[l]; a[j+l] = a[j] + b[jl; 

fOr(j~0;j<10;j++) > " Jr 

^ a[j+l] - a[j] + bill: 

Figure J 7: Example of dead code elimination 

Forward Substitution 

This optimization is a generalization of copy propagation. The use of a variable is replaced bv its 
defining expression. It can be used for simplifying the data dependency analysis and the application of 
other transformations by making the use of loop variables visible. PF««iiionor 

% " " * }' ■ for (1=0.- i<= N; i++J 

C K; -w'riui, . a[N+i] a?biN+i] + *™> 

Figure 18: Example of forward substitution 

Idiom Recognition 

This transformation recognizes pieces of code and can replace them by calls to compiler known 
functions, or less expensive code sequences, like code for absolute value computation; 
for (1=0; i<N; < for (1-0; i< N; { 

if (c<0> c = abs(c); 

afl] = c; i 

> 

Figure IP: Example of idiom recognition- 



3.2.2 Loop Transformations 
Loop Normalization . 

^l^e^Zf^T^ff ^ rati0n SpaCC ° f fte is ^ S wi * * lower bound equal to 0 
EL. JfSS ? ,npUt and with a step of 1. The airay subscript expressions and the 

^eintS r m f SSd a °r Hn ^ ? 630 * ^ before W fiisioTLS^o^tit 

for(i=2; ±<N; 1=1+2) £or(i-0; i<(N-2)/2,- i++) 

a£xj - b[ A ], a[2*±+2J - b[2*l+2] / 

Figure 20: Example of loop normalization 
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this transformation changes the direction in which (he iteration space of a loop is scanned. It is usually 
used in conjunction whh loop normalization and other transfonnations, like loop interchange, because 
it changes the dependence vectors. 

for(i=N; i>=>0; i~> for(i="Of i<=N; i++) 

a[il = b[i]; . a[i] = bti]; 

Figure 21: Example of loop reversal 

Strength Reduction 

l^^f^T^ fepl ^f ^P* 55 ™ 5 » *• lo °P bodv '*y equivalent but less expensive ones. Tt can 
be used on induction variables, other than the loop variable, to be able to eliminate them. 

for (1=0; i<N; t = c; 

a(l] = bli] + c»i,-. for(i=0; i<:N; { 

a[i) m b[ij + t; 
t = t + c; 

1 

figure 22: Example of strength reduction 

Induction Variable Elimination 

This tamsftrmation can use strength reduction to remove induction variables from a loop hence 
reducing the number of computations and easing the analysis of the loop. life also 5emo?es 
dependence cycles due to the update of the variably enabling vectorization. removes 

- for(i=0; i<=N; { for{i=0; i<=N; i++) { 

k a k + <N+D*3,' 

Figure 23: Example of induction variable elimination 

Loop-Invariant Code Motion 

So^ S ^5 Bti °W.T OVeS f° m P" tations 0uts i d * » I°op if their result is the same in all iterations This 
allows to reduce the number of computations in the loop body. This optimization^™ also 

3S5l22^ feSh, "° n ^ ^ 10 - 5ted L ^ *tt£ tTnaS £ 

for(i=0; i<N; i++) if (N >== 0) 

a[i] « b[i] + c « x * y . 

a[I) = + Cr - 

F/gurc 24- Example vf loop-invariant code motion 
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Loop Unswitching 

This.transformatign moves a conditional instruction outside of a loop body if its condition is loop- 
invariant. The branches of the condition are then made of the original loop with the appropriate 
original statements of the conditional, statement. It allows further parallelization of the loop by 
removing control-flow in the loop body and also removing unnecessary computations from it. 



ror(i=0; i<N r - { 
a[i] = b[i] + 3; 
if (X > 2) 

b[i] = c [i] + 2; 

else 

b[i] = c[i] - 2; 

> 



if (x > 2) 

for(i=0; i<N,-' it+) { 
a[i] = b£ij + 3; 
b[i] = c(i] + 2; 

} 

else 

for(i=0; i<N; i++) { 
atij - b(i] + 3; 
b[i] - ctil - 2; 

) 



. Figure 25; Example of loop unswitching 



If-Conversion 

This transformation is applied on loop bodies with conditional instructions. It changes control 
dependences into data dependences and allows then vectorization to take place. It can be used in 
conjunction with loop unswitching to handle loop bodies with several basic blocks. The conditions 
where array expressions could appear, are replaced by boolean terms called guards. Processors with 
predicated execution support can execute directly such code. 



ford - 0;i < N; i++) { 
atij = a[i] + b[i) ; 
if (a[ij != 0) 

if (a[iO > c[i]) 
a[i] - atij - 
else 

a[i] = a[i] + 
d[i] - aTi] * 2,- 

> 



for(i - 0;i < N;i++) { 
a[ij * al'ij + b[i]; 
c2 - (a£i] »= 0); 
•if (c2) c4 <= tafi]. > 
2; if (c2 S& C4) a[i] = 

if (c2 && lei) a[i] 
1? d[i] = a[i] * 2; 

} 



o[i]); 
a[i] - 2; 
■ aliJ + 1; 



Figure 26: Example of '^conversion 

Strip-Mining 

This transformation enables to adjust the granularity of an operation. It is commonly used to choose 
me number of independent computations in the inner loop nest. When the iteration count is not known 
at compile tune, it can be used to generate a fixed iteration count inner loop satisfying the resource 
constraints. It can be used in conjunction with other transfonnations like loop distribution or loop 
interchange. It is also called loop sectioning. Cycle shrinking, also called stripping, is a specialization 
of strip- mining. r 



for(i=<0; i<tJ; 

aEi] = b[i] + c; 



up - (N/16)*16; 

for(i=0; Kup; i = i + 16) 

a[i:l+16] m b[iri+16) 
for(j=i+l; j<tJ; j++) • 

a[ij - b[i] + o; 



+ c; 



Figure 27: Example of strip-mining 
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Loop Tiling 

This transformation modifies the iteration space of a loop nest by introducing loop levels to divide the 
iteration space in tiles. It is a multi-dimensional generalization of strip-mining. It is generally used to 
improve memory reuse, but can also improve processor, register, TLB, or page locality. It is also 
called loop blocking. 

The size of the tiles of the iteration space is chosen so that the data needed in each tile fit in the cache 
memory, thus reducing the cache misses. In the case of coarse-grain computer^ the size of the tiles can 
also be chosen so that the number of parallel operations of the loop body fit the number of processors 
of the computer. 

for(i»0,- i<N,- i++) for(ii=0; ii<N; ii = ii+16) 

' £or(j=0,- j<N; j++) for(jj«0,- 3j<Nj jj = jj+16) 

a[i][j] »b[jj[i]; for(i=ii; i< min (ii+lS,N) ; 

for(J=jj; j< min(jj+15,N>; j++) 
[j] - b[j)[i]; 

Figure 28: Example of hop tiling 

Loop Interchange 

This transformation is applied to a loop nest to move inside or outside (depending on the seandied 
effect) die loop level containing data dependences. It can: 

■ enable vectorization by moving inside an independent loop and outside a dependent loop, or 

■ improve vectorization by moving inside the independent loop with the largest range, or 
• deduce die stride, or 

" increase the number of loop-invariant expressions in the inner- loop, or 

■ improve parallel performance by moving an independent loop outside of a loop nest to increase the 
granularity of each iteration and reduce the number of barrier synchronizations. 

forCf-O,- i<N? fortj-O; j<N; 

for(j=O r -j<N; j++) for(i-0; i<N; i++) 

aUJ - ati] +bti)tjj; a[ij « a [iJ +b[i][jj; 

Figure 29: Example of loop interchange ■ 

Loop Coalescing / Collapsing 

This transformation combines a loop nest into a single loop, ft can improve the scheduling of the loop 
and also reduces the loop overhead. Collapsing is a simpler version of coalescing in which the number 
of dimensions of arrays is reduced as well. Collapsing reduces the overhead of nested loops and multi- 
dimensional arrays. Collapsing can be applied to loop nests that iterate over memory with a constant 
stride, otherwise loop coalescing is a better approach. It can be used to make vectorizing profitable by 
increasing the iteration range of the innermost loop, 

for(i=0; i<N; for(k=0; k<N*M; k++) { 

for(3=0;j<M; ± . ( ( k-l)/m>*m + i; 

a[i][ 3 ] = a[±)Cjj + C ; j « C(T-l)%ra> + 1 ? 

ati] [jj m ati] [jj + C ; 

> 

Figure SO: Example of loop coalescing 
32 
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Loop Fusion 

This transformation, also called loop jamming, merges 2 successive loops. It reduces loop overhead, 
increases instruction-level parallelism, improves register, cache, TLB or page locality, and improves 
the load balance of parallel loops. Alignment can be taken into account by introducing conditional 
instructions to take care of dependences. 

for(i=0; i<N; for(i=0; i<N; i++) { 

a[i] = b[i] + c; a[i] - b[ij + c; 

d[i] = e[ij- + c; 

for(i»0; i<N; ) 
- e[i] + c; 

Figure 31; Example of loop fusion 



Loop Distribution 

This transformation, also called loop fission, allows to split a loop in several pieces in case the loop 
body is too big, or because of dependences. The iteration space of the* new loops is the same as the 
iteration ?pace of the original loop. Loop spreading is a more sophisticated distribution. 

for(i=0; i<N; i++) { for(i=0; i<N; i++) 

a[i] « b[i] '+ c; a[i) ~ b[i3 + c; 

d[i) - eCi] + c; 

} for(i«0;i<:N; i++) 

Figure 32; Example of loop distribution 

Loop Unrolling / Unroll-and- Jam 

This transformation replicates the original loop body in order to get a larger one. A loop can be 
unrolled partially or completely. It is used to get more opportunity for parallelization by making the 
loop body bigger, it also improves register, or cache usage and reduces loop overhead. Loop unrolling 
the outer loop followed by merging the induced inner loops is referred to as unroll-and-jam. 

for(i=0; i<N; i++) for(i*0; i<N; i - i+2) { 

a[ij - b[i] + c a[i] b[i] + c; 

a(l+lj = b(i+l] + c; 

} 

if ((N-l)%2) — 1) 

a[N-l] - b[N-l] + c; 

Figure 33: Example of loop unrolling 

Loop Alignment 

This optimization transforms the code to get aligned array accesses in the loop body. Its effect it to 
transform loop-carried dependences into loop-independent dependences, which allows to extract more 
parallelism from a loop. It can use different transformations, like loop peeling or introduce conditional 
statements, to achieve its goat This transformation can be used in conjunction with loop fusion to 
enable this optimization by aligning the array accesses in both loop nests. In the example below, all 
accesses to array a become aligned. 
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for(i=2;i <= { for(i=l; i<=N; { 

a[i] - b[l] + c[i]/ if a[i] = b(i] + c[i]; 

- a[i-l) * 2,- if (i<N) dfi+1] = a[i] * 2; 

^ e{i] = a[i-l] + d[i+l]; ^ if (i<N) e[i+l] . a [i] + d[i+2]; 

Figure 34: Example of loop alignment 

Loop Skewing 

This transformation is used to enable parailelization of a loop nest It is useful in combination with 
loop interchange. It is performed by adding the outer loop index multiplied by a skew factor/ to the 
bounds of the inner loop variable, and then subtracting the same quantity from every use of the inner 
loop variable inside the loop. * ' 

for(i=l; i <« N; i++) -for(i=l; i <*> N; i++) 

for(j=l?j <= N? for (3 =1+1. ;j <= i+N; j++) 

a[i) - ati+j] + c; _ a[i] = a[jj + c; 

Figure 35: Example of loop skewing 

Loop Peeling 

This transformation removes a small number of beginning or ending iterations of a loop to avoid 
dependences in the loop body. These removed iterations are executed separately. It can be used for 
matching the iteration control of adjacent loops to enable loop fusion. 

fiorU-0; i<=N; atOlCN] = a[0] [Nj + a[N][N]; 

afiHN] = a[0][NJ + a[N][N]; for (i=l;i<=N-l; i ++ > 

a[i] £NJ = a'tOJtN] + a[N][NJ,- 
atN]tN] -a[OJtW +a[N][Nj; 

Figure 36: Example of loop peeling 

Loop Splitting 

MtesSS? ^tfJZS* ^Tu in by Creatin « Io °P neste - * * ^o called 

£S S P httm & ^ 15 generally used because of dependences that prevent parailelization The 
J^on^« of the new loops is a subset of the original one. It can be'seen as'a generaSon of 

for (1=0; 1<=N; i++) for(i=0;i< (N+l)/ 2 ; i+ + ) 

ati] -» a[N-i+l] + c; ati] - a£N-l+l] + C ; 

for(i= (N+l)/2,-i <= N;i++) 
a[i] = a[N-i+l] + c; 

Figure 37: Example of loop splitting 

Node Splitting 

This ttansfonnation splits a statement in pieces. It is used to break dependence cycles in the 
sSemtir ^ t0 "° hi8h *"" ,tai * ° f * e nodes > *» «*-■- v^orSSon of tSe 
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for (1=0,- i < N;i++) { for(i => 0,i < N;i++) { 

b[i] = a[i] + c[i] * d[i] ; tl[i] = c(i] * d[i] ? 

a(i+l] = b[i] *. - c[ij>; " t2[i] m d[i] - c[i]; 

> b[i] - a[i] + tlfij; 

ati+1] = b[i] * t2[±] ; 

> 

Figure 38: Example of node splitting 

Scalar Expansion 

This transformation replaces a scalar in a loop by an array to eliminate dependences in the loop body 
and enable parallelization of the loop nest If the scalar is used after the loop, compensation code must 
be added. 

for (1=0? i<N; !++){■ for(i=0;i<N; { 

c = b[i]; • tmpEi] =b[i],- 

j a(i] - a[i] + c? ^ a[i) = a[i] + cmpti],- 

C = tmptK-1]; 
Figure 39: Example of scalar expansion 

Array Contraction I Array Shrinking 

This transformation is the reverse transformation of scalar expansion. It may be needed if scalar 
expansion generates too many memory requirements. 

for (1=0; i<N?i++) for(i=0; i<;N;i++) 

for(j=0; j<N;j++).{ for(j=0? i<N;j++) { 

tfi][jj a[il£j] * 3 . t[j j t a[l][j] * 3, 

} btiJMJ - tiijrj]- + c[jj, b[ij[j] = t[jj + c[j], 

Figure 40:, Example of array contraction 

Scalar Replacement 

This ^formation replaces an invariant array reference in a loop by a scalar. This array element is 
loaded in a scalar before the inner loop and stored again after the inner loop, if it is modified. It can be 
used in conjunction with loop interchange. 

.for(i=0; i<N; i++) for(i=0;i<N; i++) ( 

forCj-O; j<N,-j++) trap « adJ; 

a(i] = a[i] + b[i] [j]; " £or(j=0; j<N;j+-»-) 

tmp = tmp + b[i] [j]; 
aCi] « trap; 

) 

Figure 41: Example of scalar replacement 

Reduction Recognition 

This transformation allows to handle reductions in loops. A reduction is an operation that computes 
scalar value from arrays. It can be a dot product; the sum or minimum of a vector for instance. Th 
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S?cl S ^ t l- P ? rfbn ? 38 "!^ y °P«*i*» in Pastel as possible. One way is to accumulate a vector 
register of partial results and then reduce rt to a scalar with a sequential loop. Maximum parallelism is 
Jen amoved by reducing the vector register with a tree: pamTof elements are summ^L *Sn pate of 
these results are summed, etc. ^ ^ ■ • 

£or(i=0; i<N,-i++) for{i*0; i<N; i=i+64) 

s=s+ a[i); tmpt0:63] = tmp[0:63] + a[i:i+G3]; ' 

for(i=0; i<64?i++) 

S » 3 + tmpflj; 

Figure 42: Example of reduction recognition 

Loop Pushing / Loop Embedding 

This ^formation replaces a call in a loop body by the loop in the called function It is an inter 
P™^***™™?**- » allows the paiallelization of the loop nest and enmuia£ the overhead 
caused by the procedure call. Loop distribution can be used in conjunction with tao^S^ 
for(i=0; i<N; i++) ' fZ( x ) 

f(x,i)f 

. , ... _ . void f2 (int* a) { 

voxd £(int* a,int j) ( . for(i=0; i< M; i++ , 

a[ 3 J = a[j] + cr a[i] = a [jL] + ' c; 

1 - > 

Figure 43: Example of hop pushing 

Procedure Inlining 

^ced3 f 3S^ P,S ?r ,1 0311 to , a proCedure ^ me ^ oftne Procedure itself, ft is an inter- 
J^te^^^^ ^ 10 ^ ?— — overhead caused by me 

for(i=0; i<N,- . for(i=0; i< N ,- i++) 

void f(int* x, int j){ 
*[jl - x[J] + C ; 

Figure 44: Example of procedure Mining 

Statement Reordering - 

Sle^Sr^^ inStmCti ° nS ^Hr to modify the data dependence graph and 
for(i=0,-i < N ; -i++J < for(i=0; i< N ; i++ > ( 

} ° U] ~ ati ~ 1J " «-* ^ atil = b[i] * 2; 

Figure 45: Example of statement reordering 
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Software Pipelining 

This transformation parallelizes a loop body by scheduling instructions of different instances of the 
loop body. It is a powerful optimization to improve instruction-level parallelism. It can be used in 
conjunction with loop unrolling. In the example below, the preload commands can be issued one after 
another, each taking only one cycle. This time is just enough to request the memory areas. It is not 
enough to actually load them. This takes many cycles, depending on the cache level that actually has 
the data. Execution of a configuration behaves similarly. The configuration is issued in a single cycle, 
waiting until all data are present. Then the configuration executes for many cycles. 
Software pipelining overlaps the execution of a configuration with the preloads for the next 
configuration. This way, the XPP array can be kept busy in parallel to the Load/Store unit. 

Issue Cycle Command 

XPPPreloadConfig (CFG1) ; 

for (i=0; i<;lOO,- ++i) { 
Is XPPPreload(2,a+lG*i,10) ; 
2: XPPPreload(5,b+20*i,20) ; 

3: 

4: // delay 
5z 

6: XPPExecute (CFGl) ; 
) 



Issue Cycle Command 

Prologue XPPPreloadConf ig(CFGl) . 
XPPPr eload ( 2 , a , 10 ) ; 
XPPPreload(5,b,20) ; 
// delay 

for (i=l; i<;100; { 
Kernel Is XPPExecute (CFGl) ; 

2: XPPPreload(2,a+10*i,10) ; 
3: XPFPreload(5,b+2D*i,20) ; 
4: } 

XPPExecute (CFGl) ; ' • 
Epilog // delay 

Figure 46: Example of software pipelining 



Vector Statement Generation 

This transformation replaces instructions by vector instructions that can perform an operation on 
several data in parallel. 

for(i=0,- i<=W; ' a[0:N] = b[0:N),- 

a[i] - b[i]; 



Figure 47: Example of vector statement generation 



3.2.3 Data-Layout Optimizations 

In the following we describe optimizations that modify the data layout in memory in order to extract 
more parallelism or prevent memory problems like cache misses. 
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Scalar Privatization 

This optimization is used in multi-processor systems to increase the amount of parallelism and avoid 
unnecessary communications between the processing elements. Tf a scalar is only used like a 
temporary variable in a loop body, then each processing element can receive a copy of it and achieve 
its computations with this private copy. 

for(i=0;i <= tf;l++) { 
. a[i] - a[ij + c; 

) 

Figure 48: Example for scalar privatization 

Array Privatization 

This optimization is the same as scalar privatization except that it works on arrays rather than on 
scalers. 

Array Merging 

This optimization transforms the data layout of arrays by merging the data of several arrays following 
the way they are accessed in a loop nest This way, memory cache misses can be avoided. The layout 
of the anays can be different for each loop nest. Below is the example of a cross-filter, where the 
accesses to array « are interleaved with accesses to array b. The picture next to it represents the data 
layout of both arrays where blocks of a (in green) are merged with blocks of b (in yellow). Unused 
memory space is in white. Thus cache misses are avoided as data blocks containing arraysa and b are 
loaded into the cache when getting data from memory. More details can be found in [1 1]. 

for(j=l;j<=N-l;i++) ■ 
for (j=»l;j<=N; j++) 

b[i]tjl = 0.25*(ati-l] [j] + a[i]tj-lj + 
[jj + a(i] {j+lj); 



Figure 49: Example for array merging 

3.2.4 Example of application of the optimizations 

As seen before a lot of optimizations can be performed on loops before and also after generation of 
vector statements. Finding a sequence of optimizations that would produce an optimal solution for all 
loop nests of a program is still an area of research. Therefore we can only propose a way to use these 
optimizations that follows a reasonable heuristic to produce vectorizable loop nests. To vectorize the 
code, we can use the Allen-Kennedy algorithm that uses statement reordering and loop distribution 
before vector statements are generated. It can be enhanced with loop interchange, scalar expansion, 
index set splitting, node splitting, loop peeling. All these transformations are based on the data 
dependence graph. A statement can be vectorized if it is not part of a dependence cycle, hence 
optimizations are performed to break cycles or, if not completely possible, to create loop nests without 
dependence cycles. 

We can divide the whole process in four majors steps. First we should restructure the procedures by 
analyzing the procedure calls inside the loop bodies and try to remove them. Then some high-level 
dataflow optimizations are applied to the loop bodies to modify their control-flow and simplify then- 
code. The third step would consist in preparing the loop nests for vectorization by building perfect 
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loop nests and ensuring that inner loop levels are vectorizable. Then optimizations can be performed 
that target the architecture and optimize the data locality. It should also be noted that other 
optimizations and code transformations can occur between these different steps that can also help to 
further optimize the loop nests. * 

Hence the first step applies procedure inlining and loop pushing to remove the procedure calls of the 
loop bodies. Then the second step consists of loop-invariant code motion, loop unswitching, strength 
reduction and idiom recognition. The third step can be divided in several subsets of optimizations. We 
can first apply loop reversal, loop normalization and if-conversion to get normalized loop nests. This 
allows to build the data dependency graph. Then if dependences prevent the loop nest to be vectorized 
transformations can be applied. For instance if dependences occur only on certain iterations, loop 
peeling or loop splitting can be applied. Node splitting; loop skewing, scalar expansion or statement 
reordering can be applied in other cases. Then loop interchange moves inwards the loop levels without 
dependence cycles. The goal is to have perfectly nested loops with the loop levels carrying dependence 
cycles as much outwards as possible. Then we can apply loop fusion, reduction recognition, scalar 
replacement/array contraction and loop distribution to further improve the following vectorization. 
Vector statement generation can be performed at last using the Allen-Kennedy algorithm for instance. 
The last step can consist of optimizations like loop tiling, strip-mining, loop unrolling and software 
pipelining that take into account the target processor. 

The number of optimizations in the third step is large, but not all of them are applied to each loop nest 
Following the goal of the vectorization and the data dependence graph only some of them are applied. 
Heuristics are used to guide the application of the optimizations, that can be applied several times if 
needed. Let us illustrate this with an example. 

void f (int** a, int** b f int *c,int i, int j) { 
^ a£i][j] =a[±][j-lj - bti+lj ; 



void g(int* a, int* cint i) { 
a[i] = c[i] +* 2; i 



for{i=0,- i<N;i++) { 
for(j«l; j<9;j»j++) 
if (k>0) 

f <a,b,i, j); 
else 

> 

dCi] = dCi+1] + 2; 

} 

for(i«0; i<Nfi++) 

ali] [i] - + 3; 



The first step will find that inlining the two procedure calls is possible, then loop unswitching can be 
applied to remove the conditional instruction of the loop body. The second step begins by applying 
loop normalization and analyses the data dependence graph. A cycle can be broken by applying loop 
interchange as it is only carried by the second level. The two levels are exchanged, so that the inner 
level is vectorizable. Before that or also after, we apply loop distribution. Loop fusion can be applied 
when the loop on i is pulled out of the conditional instruction by a tradition*! redundant code 
elimination optimization. Finally vector code can be generated for the resulting loops. 

So in more details, after procedure inlining, we obtain: 
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for(i«0; i<N;i++) { 
for(j=l; j<9;j=j++> 
if (k*Q) 

a[i] [j] - a[ij [j-1] 
else 

- ctj] + 2,- 

> 

- d[i+l] + 2; 



- [j-1]; 



fOr(i=0; 

a[ij [i] « b[i] + 3; 

After loop unswitching, we obtain: 

if (k > b) 

for (1=0; i<N;i++) { 

for(j=l; j<9;j=j++) 

a[i][j] = a[ij[j-l] - b[i+l]"[j-l]; 
d[i] - + 2; 

> 

else 

for<i*=0; i<N;i++> { 

for ( j=l; j<9;j=j++) 

d[jj = c[j] + 2; 
d[i] « d[i+lj + 2; 

> 

£or(i»0; i<N/i++) 

a[ij [i] = b[i] + 3; 

After loop normalization, we obtain: ' 

if (k > 0) 

for(i=0; i<N;i++) { 

for<3=0; j<:3; j=g++) 

a[i]tj+l] «a[i)[j] -b[i+l][j]; 
. d[i] - d[i+l] + 2; 

> 

else 

for(i=0;. i<N;i++) { 

for(j«0; j<8;j=j++) 

d[j] - c[j+l] + 2; 
dri] = d[i+U f 2,- 

> 

for(i=Q; i<N;i++) 

a[i] [ij - b[ij + 3; 

After loop distribution and loop fusion, we obtain: 

if (k > 0) 

for(i=0; i<N;i++) 

for (5=0; j<8;]=3++^ 

a[iJ[j+l] =a.ti][j] -b[i+lj[jj; 

else 

£or(i-=0, : i<N;i++) 

for (3=0; j<8;j=j++) 

- c[j+l] + Z- 

for(i=0; i<N;i++) { 

d(i] - d[i+l] + 2; 
a[i] [i] - b[i] + 3; 

) 

AO 
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After loop interchange, we obtain: 

if (k > 0) 

for(j=0; j<8;j=j++) 
for(i=0; i^N;i++) 

aUHj+13 =a[i][j] -b{i+l][j],- 

else 

for(i«0; 

for(j=0; j<8;j»j++) 

dim - C[j+1] + 2; 

for(i=0; i<N;i++) { 

d[i] m d[i+lj + 2/ ' 
a[i] [i] « b[i] + 3; 

) 

After vector code generation, we obtain 

if (k > 0) ... 
fOr(j=0; j<8;j= s j++) 

a[0:N-l] [j+1] = a[0:N~l)[j] -b[0:N][j]; 

else 

for(i«0; i<N;i++) 

d[0;8] - C[l:9] .+ 2; 

d[0:N-l] = d[l:NJ + 2; 
a£0:N-l] [0:N-1] = b[0:N] + 3; 
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4 Compiler Specification for the 
PACT XPP 



4.1 Introduction 

A cached RI5C-XPP architecture exploits its full potential on code that is characterized by high data 
locality and high computational effort A compiler for this architecture has to consider these design 
constraints. The compiler's primary objective is to concentrate computational expensive calculations 
to innermost loops and to make up as much data locality as possible for them. 

The compiler contains usual analysis and optimizations. As interprocedural analysis, like alias 
analysis, are especially useful* a global optimization driver is necessary to ensure the propagation of 
global information to all optimizations. The following sections concentrate on the way the PACT XPP 
influences the compiler. 



4.2 Compiler Structure 



Figure SO shows die main steps the compiler must follow to produce code for a system containing a 
RISC processor and a PACT XPP. The next sections focus on the XPP compiler itself, but first the 
other steps are briefly described. 
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Code Preparation 








Partitioning 








XPP Compiler 








RISC Code Gen. 








RISC Code Sched. 



r . 

Figure 50: Global View of the Compiling Process 



4.2.1 Code Preparation 

This seep takes the whole program as input and can be considered as a usual compiler front-end. It will 
prepare the code by applying code analysis and optimizations to enable the compiler to extract as 
many loop nests as possible to be executed by the PACT XPP. Important optimizations aze idiom 
recognition, copy propagation, dead code elimination, and all usual analysis like dataflow and alias 
analysis. 4 

±22 Partitioning 

Partitioning decides which part of the program is executed by the host processor and which part is 
executed by the PACT XPP. 

A loop nest is executed by the host in three cases: 

■ if the loop nest is not well-formed, 

■ if the number of operations to execute is not worth it to be executed on the PACT XPP, or 
- if it is impossible to get a mapping of the loop nest on the PACT XPP. 

A loop nest is said to be well-formed if the loop bounds and the step of all loops are constant, the loop 
induction variables are known and if there is only one entry and one exit to the loop nest. 

"Another problem arises with loop nesis where the loop bounds are constant but unknown at compile 
time. Loop tiling allows to overcome this problem, it will be described below. Nevertheless it could be 
that it is not worth it to execute the loop nest on the PACT XPP if the loop bounds are too low, A 
conditional instruction testing if the loop bounds are large enough can be introduced, and 2 versions of 
the loop nest axe produced. One would be executed on the host processor, and the other on the PACT 
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XPP when the bop bounds arc suitable. This would also ease applications of loop transformations, as 
possible compensation code would be simpler due to the hypothesis on the loop bounds. 

4.2.3 RISC Code Generation and Scheduling 

After the XPP compiler has produced NML code for the loops chosen by the partitioning phase the 
mam compiling process must handle the code that will be executed by the. host processor where 
instructions to manage the configurations have been inserted. This is the aim of the last two steps: 

■ RISC Code Generation and 

■ RISC Code Scheduling. 

The first one produces code for the host processor and the second one optimizes it further by lookine 
for a better scheduling using software pipeUning for instance. ^ s 



4.3 XPP Compiler for Loops 



? mteraal processing of the XPP Compiler. It is acomplex cooperation between 
program transformations, included in the XPP Loop Optimizations, a temporal pa^tionine phase 
NML code generation and flic mapping of the configuration on the PACT XPP pamT,onnis P nase - 



± 



XPP Loop Opt. 



T 



no 



exit 




jail&iio change |Jtoo"bjj^ 
no J 



Temporal Partitioning 



yes 



Figure 51 : Detailed Architecture of the XPP Compile, 



First loop optimi^tions targeted at the PACT XPP are applied to Uy to produce innermost loon bodies 
that can be executed on the array of processors. If this is the c^, NML c*T*^S^ 
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called, if not then temporal partitioning is applied to get several configurations for the same loop. After 
NML code generation and the mapping phase, it cam also happen that a configuration wiii not fit on the 
PACT XPP. In this case the loop optimizations are applied again with respect to the reasons qf failure 
of the NML code generation or of the mapping. If this new application of loop optimizations does not 
change the code, temporal partitioning is applied. Furthermore we keep track of the number of 
attempts for the NML Code Generation and the mapping, if too many attempts are made, and we still 
do not obtain a solution, we break the process, and the loop nest will be executed by the host 
processor. 

4J.1I T©mrap®tmifl FiF®®[rBa[ragj 

Temporal partitioning splits the code generated for the PACT XPP in several configurations if the 
number of operations, Le. the size of the configuration, to be executed in a loop nest exceeds the 
number of operations executable in a single configuration. This transforation is called loop 
dissevering [6]. These configurations are then integrated in a loop of configurations whose number of 
execution corresponds to the iteration range of the original loop. 

* 4 J J ^©iJSiiJii®[fa of WML <MI@ 

This step takes as input an intermediate form of the code produced by the XPP Loop Optimizations 
step, together with a dataflow graph built upon it NML code can then be produced by using tree- or 
DAG-pattem matching techniques. 




This step takes care of mapping the NML modules on the PACT XPP by placing the operations cum the 
ALUs, FRBGs, and BREGs, and routing the data through the buses. 



The loop optimizations used for the PACT XPP are now described. Their goal is to extract as much 
parallelism as possible from the loop nests in order to execute them on the PACT XPP by exploiting 
the ALU°PAEs as effectively as possible and to avoid memory bottlenecks with die IRAMs. The 
following sections explain how they are organized and how to tak<? into account the architecture for 
applying the optimizations. 

®ci)iiiniM®[ni ®iFte%s£©inni 

Figure 52 below presents the organization of the loop optimizations. The transformations are divided 
m six groups. Other standard optimizations and analysis are applied in-between. Each group could be 
called several times. Loops over several groups can also occur if needed. The number of iterations for 
each driver loop can be of constant value or determined at compile time by the optimizations itself 
(e.g. repeat until a certain code quality is reached). In the first iteration of the loop, ie can be checked if 
loop n&sts are usable for the PACT XPP, it is mainly directed to check the loop bounds etc. For 
instance if the loop nest is well-formed and the data dependence graph does not prevent optimization, 
but the loop bounds are unknown, then in the first iteration loop tiling is applied to get an innermost 
that is easier to handle and can be better optimized, and in the second iteration, loop normalization, if- 
conversion, loop interchange and other optimizations can be applied to effectively optimize the 
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* ^" — 7 -ntfl now with the 

s™te ne£te »»?ble for the data dependence analysis. Gn^TcSLuTSnSo^ 2 

contains optimizations that ensure that the innermost loops can be executed on the PACT XPPrll 
VI contains optimizations that further extract parallelism ftoni ttofa^£S£ nZ TvS 
optimizafons more towards optimizing the usage of the hardware itLlf ^ VU contams 

En each group the application of the optimizations depends on the result of the analvtf, ««i *„ 
applied. It depends on the data dependence graph computed before, 7 
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figure 52:Detaited View of the XPP Loop Optimizations 



4A2 Loop Preparation 



The optimizations of Groups I, n and III of the XPP compiler generate loop bodies without procedure 
calls, conditional instructions and induction variables other than loop control variables. Thus loop 
nests, where the innermost loops are suitable fbr execution on the PACT XPP, are obtained. The 
iteration ranges are normalized to ease data dependence analysis and the application of other .code 
transformations. 
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' ™ 0 P^^'ro3 of Group W are pexftnmedto obtain innermost loops suitable for vectorizatioia with 
respect to the data dependence graph. Nevertheless a difference with usuai vectorization is that a 
dependence cyde^ that would normally prevent any vecmrizatiom of the code, does not prevent the 

SKKiSlJX T- *£? F 3 ^ b due to m a^^ce, then * eoli b* 
tnat rt won t prevent optimization of the code as stated in £7]. Furthermore dependence cycles will not 
P^vent vector^ for the PACT XPP when it consists 0ffi ly of a loop-ca^ed t^d^endTce on 
the same exp^on. if cycles wh fa distance* occur in the data ^^£^^Zthm^h^ be 
handled by holding k values in register. This optimization is of the same c£ 

dSln^ teSS JS tetJemS * ^ ^P 6 "^ fi«P« ^sl Loop nests cannot be handled if some 
dependence distances are not constant, or unknown. If only a few dependence prevent Te 
optmuzatmn of the whole loop nest, this could be overcome, by using the SiS 

IT* r P 'S M,ly thC connected component rif^^^^ceTS 

b^TpTc^ diStribUti0fl - ms which 

oy me i-AtrxfP and some by the host processor, can be obtained. 

n^oL^?^ T* 1 ^ parameters the application of the loop transforations The number 

of operations and memory accesses, that a loop body performs is l^^TT^ T,"? 
P^etera influence loop unrelling, strfp-miaii Jp^a^d^o^intS^^ 

Jfae table below lists the parameters mat influence the application of the optimizations F or nf 

Vecto r len^SEfa S t^f ^J*"?* 1 T ^ ^ * e of the optimizations 

accSidXe Snn^n^B § 1 ? *" ^a? 6 ™ 10 * ,W P^ ^ber of elements of an airay ' . 

i/tj- 1KAM3, ALU, FJREO, BREG stand for the number of IRAMs ALUs FREn* <™a ntttsn 

the number of cydes dedicated to the control. The application of each optimSon maV 
° decrease a parameter's value (=), 
D increase a parameter's value (-*>), 
D not influence a parameter (id), or 

0 adapt a parameter's value to fitinBo me goal size (make fit). 

SSr;^ *? kept for Mntro! ™ configuration,- this means that the 
optjmoahons should sot make the needs exceed more than 70-80% each resold. 
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Parameter 


Goal 


Starting Value 


Vector length 


IRAM size (256 words) 


Loop count 


Reused data set size 


Approx- cache size 


Algorithm analysis/loop sizes 


I/OlRAMs 


PACTsfee(16) 


Algorithm inputs + outputs 


ALU 


PACT size (< 64) 


ALU opcode estimate 


BREO 


PACT size (< 80) 


BREO opcode estimate 


FREG 


PACT size (< 80) 


FREG opcode estimate 


Data flow graph width 


High 


Algorithm data flow graph 


Data flow graph height 


Small 


Algorithm data flow graph 


Configuration cycles 


< command line parameter 


Algorithm analysis 



Here are some additional notations used in the following descriptions. Letn be the total number of 
processing elements available, r, the width of the dataflow graph, m 9 the maximum number of input 
values in a cycle and out, the maximum number of output values possible in a cycle. On the PACT 
XPP, n is the number of ALUs, FREGs and BREGs available for a configuration, r is the number of 
ALUs, FREGs and BREGs thai can be started in parallel in the same pipeline stage and,//? and out 
amount to the number of available IRAMs. As IRAMs have I input port and 1 output port, the number 
of IRAMs yields directly the number of input and output data. 

The number of operations, of a loop body is computed by adding all logic and arithmetic operations 
occurring in the instructions. Hie number of input values is the number of operands of the instructions 
regardless of address operations. Hie number of output values is the number of output operands of the 
instructions regardless of address operations. To determine the number of parallel operations, input 
and output values, and the dataflow graph must be considered. The effects of each transformation on 
the architectural parameters are now presented in detail. 

Loop Interchange 

Loop interchange is applied when the innermost loop has a too narrow iteration range. In that case, 
loop interchange allows to have an innermost loop with a more profitable iteration range. It can also he 
influenced by the layout of the data in memory. It can be profitable to data locality to interchange two 
loops to get a more practical way to access arrays in the cache and therefore prevent cache misses. It is 
of course also influenced by data dependences as explained earlier. 
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Parameter 


Effect 


Vector length 


+ 


Reused data set sire 


make fit 


VO CRAMS 


id 


ALU 


id 


JBREG 


id 


FREG 


id 


Data flow graph width 


id 


Data flow graph height 


id 


Configuration cycles 





Loop Distribution 

Loop distribution is applied if a loop body is too big to fit on the PACT XPP. Its main effect is to 
reduce the processing elements needed by the configuration. Reducing the need for IRAMs can only 
be a side effect. 



Parameter • 


Effect 


Vector length 


id 


Reused data set size 


id 


VO IRAMs 


make fit 


ALU 


make fit 


BREG 


make fit 


FREG 


make fit 


Data flow graph width 




Data flow graph height 




Configuration cycles 





Loop Collapsing 

Loop collapsing can be used to make the loop body use more memory resources. As several 
dimensions ate merged, the iteration range is increased and the memory needed in increased as well. 
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Da ha aft a4iAv* 

raraniBter 


Effect 


Vector length. 




Reused data set size 


+ 


I/O LRAMs 


+ 


ALU 


id 


RREG 


id 


FREG 


id 


Data flow graph width 


+ 


Data flow gmph height 


+ 


Configuration cycles 


+ 



Loop Tiling 

Loop tiling, as multi-dhnensional strip-mining, is influenced by all parameters, it is especially useful 
when the iteration space is by far too big to fit m the IRAM, or to guarantee maximum execution time 
when the iteration space is unbounded (see Section 4.4.6). It can then make the loop body fit with 
respect to the resources of the PACT XPP, namely the IRAM and cache line sizes. The size of the tiles 
for strip-mining and loop tiling can be computed like this: 

tile size = resources mailable for the loop body /resources necessary/or the loop body 

The resources available for the loop body are the whole resources of die PACT XPP for this 
configuration. A tile size can be computed for the data and another one for the processing elements, 
the final tile size is then the minimum between these two. For instance, when the amount of data 
accessed is larger than the capacity of the cache, loop tiling can be applied like below. 



£or(±=0;i <- 1048576;i++) 
<lcop body> 



-£or(i«0; i<= 1048576; i+« CACHE_SIZE> 

for(j=0; j< CACHEJ5IEE; j+=IRAM_SIZE) 
£or(k~0; k<IRAMJ3IZE;k++) 
<tiled loop body> 



Figure 53: Example of loop tiling for the PACT XPP 
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Parameter 



Vector Jength 



Reused data 



set size 



VO IRAMfc 



ALU 
BREG 



FREG 



Data flow graph width 



Data flow graph height 
Configuration cycles 



make fit 



make fit 



id 



id 



Strip-Mining 

64 ALU-PAEs which i^bT^^tflSr^ ^? 5 ?* a *™ ble ™ ** ** PACT »P £ 




Loop Fusion 



several loop 
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Parameter' 


Effect 


Vector length 


id 


Reused data set size 


id 


I/O RAMS 


+ 


ALU 


+ 


BREG 


+ 


FREG 


+ 


Data flow graph width 


id 


Data flow graph height 


+ 


Configuration cycles 


+ 



Scalar Replacement 

The amount of memory needed by the loop body should always fit in the IRAMs. Thanks to this 
optimization, some input or output data represented by array references, that should be stored in 
IRAMs, are replaced by scalars, that are either stored in FREGs or kept on buses. 



Parameter 


Effect 


Vector length 


+ 


Reused data set size 


id 


I/OlRAMs 


id 


ALU 


id 


BREG 


id 


FREG 


id 


Data flow graph width 


+ 


Data flow graph height 




Configuration cycles 


id 



Loop Unrolling 

Loop unrolling, loop collapsing, loop fusion and loop distribution are influenced By the number of 
operations of the body of the loop nest and the number of data inputs and outputs of these operations, 
as they modify the size of the loop body. The number of operations should always be smaller than rz, 
and the number of input and output data should always be smaller than in and out. 
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Parameter 


Effect 


Vector length ***"" — 


id 1 


Reused dam set size ~ — 


id 


I/ORAMs 




ALU 


+ 


BREG 


+ 


FREG 


+ 


Data flow graph width 


id 


Data flow graph height ~" 




Configuration cycles 


+ 



Unroll-and-Jam 

accesses^ in the inner ,oop. ^ l^Z^^^^S^T^ 
the number of operations of the new ^ ,oop must aso fit lie PACT ^ MorcOVer 




4.4.5 Optimizations Towards Hardware Improvements 

« options dea , 

input dataduplicatic^ (similarto s «™^ 

Shift Register Synthesis 

This optimization deals with array access , . 

several values of an array anTal* ^TSfaSi IZtSTl** T**** ° f a loo P When 
registers rather than accessing memory °Lh ZTl \1> °f 1° convenient » store them in 

g memory each time they are needed. As the same value must be stored in 
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different registers depending on the number of iterations it is alive, a value shares several registers and 
flows from a register to another ai each iteration. It is similar to a vector register allocated to an anay 
access with the same value for each element This optimization is performed directly on the dataflow 
graph by inserting nodes representing registers when a value must be stored in a register. In the PACT 
XPP, it amounts to store it in a data register. A detailed explanation can be found in [I]. 

Shift register synthesis is mainly suitable for small to medium amounts of iterations where values are 
alive. Since the pipeline length increases with each iteration for which the value has to be buffered, the 
following method is better suited for medium to large distances between accesses in one input array. 

Nevertheless this method works very well for image processing algorithms which mostly alter a pixel 
by analyzing itself and its surrounding neighbors. 



Parameter 


Effect 


Vector length 


+ 


Reused data set size 


id 


I/OIRAMs 


id 


ALU 


id 


BREG 


id 


FREG 


id 


Data flow graph width 


+ 


Data flow graph height 




Configuration cycles 


id 



Input Data Duplication 

This optimization is orthogonal to shift register synthesis. If different elements of the same array are 
needed concurrently, instead of storing the values in registers, the same values are copied in different 
(RAMs. The advantage against shift register synthesis is the shorter pipeline length, and therefore the 
increased parallelism, and the unrestricted applicability. On the other hand, the cache-IRAM 
bottleneck can affect the performance of this solution, depending on the amounts of data to be moved. 
Nevertheless we assume that cache-IRAM transfers are negligible to transfers in the rest of the 
memory hierarchy . 
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Parameter 


Effect 


Vpptiif Ton 




cvuusea uaiu b w size 






L id . 


ALU 


id 


BREG 


id 


FREG 


id 


Data flow graph width 


+ 


Data flow graph height 




Configuration cycles 


id 



Loop Pipelining 



This optimization consists in synchronizing operations by inserting delays in the dataflow 



Parameter 


Effect 


Vector length 


+ 


Reused data set size' 


id 


1/OIRAMs 


«* 


ALU 


id J 


BREG 


id 


FREG 


id 


Data flow graph width 


+ 


Data flow graph height 




Configuration cycles 


1 



Tree Balancing 
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Parameter 


Effect 


Vector length 


+ 


Reused data set size 


id 


I/O IRAMs 


id 


ALU 


id 


taper* 


id 


FREG 


id 


Data flow graph width 


+ 


Data flow graph height 




Configuration cycles 





446 Limiting the Execution Time of a Configuration 

The execution time of a configuration must be controlled. This is ensured in the compiler by strip- 
mining and loop tiling that take care that not more input data as the IRAMs capacity come in the 
PACT XPP in a cycle. This way the iteration range of the innermost loop that is executed on the PACT 
XPP is limited, and therefore its execution time. Moreover partitioning ensures that loops, whose 
execution count can be computed at run time> are going to be executed on the PACT XPP. This 
condition is trivial for for-loops, but for while-loops, where the execution count cannot de determined 
statically, a transformation like sketched below can be applied. As a result, the inner foMoop can be 
handled by the PACT XPP, 

while (ok) { while (ok) 

<loop fc>ody> for(±«0; i<100 && ok; i++) { 

> <loop body> 

) 

Figure 54: Transformation of while-loops 
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5 Case Studies 



5.1 3x3 Edge Detector 

5.11 Original Code 

Source Code: 

#ctefine ver&en 16 
fdefine HORLEN 16 
mainO { 

int v, hr inp; 

int pi [VERLEN] [HOBXjEN] ; 

int p2 [VERLENJ [HORLENT] 

int htmp, vtmp, sum; 

for(v«0; v<VERLEN; v++) // loop nes* 1 

for(h«Q; h<HORLEN; h++) { 

scanf ("%d", &pl[v][h]>7 // read input pixels to pi 
P2[v][h] « 0; // initialize p2 

£or(v«0; v<=VERLEN-3; v++) { // loop nestL 2 
for(h=»Q; h<-HORLEN-3; h++) f 

htmp - (pl[v+2][h] - pl[v][h]) +' 

(plhr+2] [h+2] - pl[v][h4.2J) + 

,^ 2 * ^I^JIh+l] -piM[h+ij), 

xf (htmp < 0) 
htmp « - htmp? 

vtmp = (pl[vj [h+2] - pl[v][hj) + 

(pi [v+2J [h+2] -pl£v+2][hj) + 
2 * (pl[v+l][h+2] - pl[v+l] [h]); 
if (vtmp < 0) 
vtmp « - vtmp; 

sum = htmp + vtmp; 

if (sum > 255) ; 

sum = 255; 
p2[v+l] fh+1] - sum; 

} 

) 

foj:(v=0; v<VERLEN; v-M-) // i oop nest 3 

for<h=0; h<HORLEN; h++) 

printf r%d\n«, p2[v][h]),- // print output pixels from P 2 
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5.1.2 Preliminary Transformations 

interprocedural Optimizations 

The first step normally invokes interprocedural transformations like function Mining and loop 
pushing. Since no procedure calls are within the loop body, these transformations are not applied to 
this example. 

Partitioning 

The partitioning algorithm chooses which code runs on the RISC processor and which code runs on 
the XPP. Since we only consider inner loops to run on the XPP, the basic blocks are annotated with the 
loop nest depth. Thus basic blocks which are not in a loop are separated out Furthermore function 
calls within a loop body prevent a loop to be considered for running on the XPP. 

In our benchmark the loop nests 1 and 3 are marked as to run on the RISC host because of the function 
call. In the following sections they are not considered any further. 

L*? Bt ^^ tt Sl C r P - Iati0tt is not P^^le if the remaining loop nests can be 

synthesized for the XPP. We just separated the ones which definitely cannot run on it, others may 
follow, since running the code on the RISC CPU is always the reassurance in our strategy. 

Loop Analysis and Normalization 

w^ldtookSe 113 * 8lready n0rmaIized lo °P s - Nevertheless it is more likely that human written code 

for(v=l; v < VERLEN - 1; v++) { 
for(h=l; h < HORLEN - 1; h++) { 

htmp =. (pl[v'+l] [h-1] - pl[v-i] [h-1] ) + 
(pi [v+l](h+l.] - pl'[v-l] [h+1]) + 
.2* (pl[v+l][h] - pl[v-l] [h]) ; 
. if (htmp < .0) 
htmp = - htmp; 

vtmp =• (pl[v-l] [h+1] - pl[v-lj [h-1] ) + 
(pi [v+l] [h+1] - pi {v+l] [h-1] ) + 
•2 * (pi [v] [h+l] - pi [v] [h-l])/ 
if (vtmp < 0) 
vtmp = - vtmp; 

sum = htmp + vtmp; • 
if (sum > 255) 

sum = 255/ 
p2 [v+l] [h+1] = sum; 

). 

} 

Althoughseen at first sight by a human reader, it is notobvious for the compiler that the loop is well 
formed. Therefore it is tried to normalize the loop. 

If the original loop induction variable is called / with the increment value s and lower and upper loop 
bounds 1 and u, respectively dxen the normalized loop with the induction variable/' and the upper 
bound u' (the lower bound V is 0 by definition) is transformed as follows- 
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• The upper bound calculates to u 1 = (u-i)/s. 
• r All occurrences of i are replaced by 1 + V * s. 

Applied to the code above, the loop statement for {v=l; v .< VERLEN - 1.; v++) with the 
lower bound vl = 1, the upper bound vu = 14 ( < 15 means <= 14 in integer arithmetic) and the 
increment vs = 1 transfonns to 

for (vn^O; vn <= (vu - vl)/vs; vn++) 
or simplified 

for(vn^0; vn <= 13; vn++) 

The 'h-loop* is transfonned equally, issuing the original code. 

Idiom Recognition 

In the.second step idiom recognition finds the abs() and minQ structures in the loop body. Please note 
that although the XPP has no abs opcode, it can easily be synthesized and should therefore be 
produced to simplify die internal representation (otherwise if-conversion has to handle this case which 
increases die complexity). - 

Therefore the code after idiom recognition looks like (ab$0 and min() are compiler known functions 
which are directly mapped to XPP opcodes or predefined NML modules) 

£or(v=0; v<«16-3; v++) { ■ . 

for(h«0; ho=16-3; h++) { 

htmp- (pl[v+2][h] -pllv][hj) + 

(pl[v+2] [h+2J -pl[v][h+2]J + 
2 * (pl[v+2J [h+1] - pl[v] [h+1]); 
htmp ^ abs (htmp) ; 

vtmp =* fpXtvJ (h+2]* pl[v] £h] > + 

(pl[v+23 [h+2] - pi [v+2] £ft] ) + 
; 2 * (pltv+lj [h+2] - pl[v+X][h]); 
vtmp « abs (vtmp); 

Sum = min(htmp + vtrnp, 255)/ 
p2 [v+lJCh+lJ =. sum; 

1 ' ' 

} 



Dependency Analysis 

forCv^O; v<=16-3; v++) { 
for(h=0; h<=16~3; h++) { 

51 htmp « (pl[v+2]rh] - pl[v][h]) +■ 

(pltv+2] [h+2] - pltvj [h+2]) + 
2 * (pl[v4.2J [h+l] - pX[v] [h+l] ) 

52 htmp » .abs (htmp); 

53 vtmp = {pl[v]*£h+2) - pl[v]fh)) + 

(pltv+2] [h+2] - pl[v+2][h]) + 
2 * <pX(v+lJ [h+2] pl[v+l][h]); 

54 . vtrop - abs (vtmp); 
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Figure 55 The expression tree of the edge 3x3 inner loop body 

ss " sum = min(htmp + vtmp, 255); 

s6 p2gv+lJth+l] = sum; 

) 

) . 

1^Za^ 100? 1 7 M J Vm ^ § l WWcb Prevent pipeUne The loop independent 

scato dependency do not prevent pipeline veetorization since the transformation does not dismrb the 

titssr"" expression substitution 7 dea4 eKm!nation wai 



5.13 Pre Code Generation transformations 

Forward Expression Substitution / Dead Code Elimination 

Sni a ^rhL^5 i ht f 1P - T P l^P allows forward expression substitution 

along with dead code elimination to place the whole calculation into one statement. 

p2[v+l) [h+i] = W in(at>s( (pi [v+2] [h] - pi [vl [h] ) + 

(pl [v+2] [h+2] pl[v]{h+23) + 
2* (pl[v+2j [h+lj - pl[v] [h+1])) 
+ abs( (pl[v][h+2] -pl[v][h}) + 

(pltv+2] [h+2] - pl[V+2) [h] ) + 
2 * (pl[v+lJ[h+2J - pl[v-+lHhJ)), 255); 

The scalar accesses then disappear completely. 
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Mapping to IRAMs 

The array accesses are mapped to IRAMs. At this stage the I RAM numbers are chosen arbitrarily, the 
actual mapping to XPP IRAMs is done later. 

Therefore we rename pl[v+x][h4y] and p2[v+x][h+y] to iramN[y}) (e. gi pl[v+2][h] to iram2[01). The 
code reads then 

iram3[l] = rain (abs <iram2 [0] - iram0[0]) + 
(iram2[2] - iram0[2]) + 
2 * (iram2[l] - irara0[l]) + 
abs (iram0[2] - iramOCO]) + 
(iram2[2) - iram2t0] + 
2 * (iraml[2] - iramlfO]), 255); 

Tree Balancing 

The visualized expression tree in Figure 55 shows another valuable optimization before matching the 
tree. Since the depth of the tree determines the length of the synthesized pipeline, another 
Simplification can decrease this depth. In both of the main sub trees the operands of the commutative 
. add expressions can be interchanged to reduce the overall tree depth. 




Figure 56 One of the sub trees before and after balancing. The numbers represent the annotated maximum tree 

depth from the node to its deepest child leaf node 

The resulting expression tree is shown in Figure 56. 



5.1 .4 XPP Code generation 

Pipeline Synthesis 

As already stated the pipeline is synthesized by a dynamic programming tree matcher. In contrast to 
sequential processors « does not generate instructions and register references but PAE opcodes and 
port connections. The main calculation network is shown in Figure 57. The input data preparation 
network is not shown in this figure. The case of synthesized shift registers are shown in Figure 58, 
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while the variant with duplicated input data simply consists of an IRAM for each input channel in 
Figure 57. 

Although this is straight forward, there remains the question how to access the different offsets of the 
vectof register accesses. Although the RAM-PAEs are dual ported it is obvious that it is not possible to 
read different addresses concurrently. 

Since it is not efficient to synthesize a configuration which generates the different addresses 
sequentially and demultiplexes the read operands into different branches of the data flow, other 
arrangements have to be made. 

The two possibilities to access input data presented in subsection 4-4.5 yield the following in RISC 
pseudo code and XPP utilization.. The pseudo code running on the RISC core looks like 

XPPPreload(config) 
for(v=0; v<=16-3; v++) { 

XPPPreload(0, &pl[v], 16) 

XPPPreloadfl, fipUv+l], 16) 

XPPPreload(2, &pl [v+2] , 16) 

XPPPxeloadClean ( 3 , &p2[v+l], 16) 

XPPExaeuteCconfig, IRAM(O), IRAM(l), IRAM(2) , IRAM<3)) 

) 

for shift register synthesis and like 

XPPPraload ( conf i g ) 

for (v=0; v<=16-3; v++) { 

XPPPxreload(0, &pl [v] , 16) . 

XPPPreloadU, &pl[v], 16) 

XPPPreload(2, &pl[v], 16) 

XPPPreload(3, fipltv+l), 16) 

XPPPreloadU, &pl[v+l], 16) 

XPPPreload{5, &pl[v+2], 16) 

XPPPreload(6, &pl[v-f2], 16) 

XPPPa-eload(7, £pl[v+2], 16) 

XPPPr«aloadClean(3, £p2 [v+1] , 16) 

X?PExecute<config, IRAM(O), IRAM(l), IRAM(2) , IRAM(3)*) 
* IRAM{4), IRAM(5), IRAM(6) , IRAM(7)) 

} - 

for data duplication, respectively. 
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Figure 57 The main calculation network of the edge3x3 configuration. The MULT-SORT..^ - , , 
abs 0 calculation s th% SORTa^Vm^ca^u^T ° ^ ** 
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Figure 58 One input after shift register synthesis. The leftmost input contains pi HP*]* the middle one 
pW[h+l] and the rightmost p1[J[h+2], respectively. 

The values for place & route and simulation are compared in the following table. Note that a common 
RISC DSP with two MAC units and hardware loop support needs about 4000 cycles for the same 
code. This comparison does not account for cache misses. Furthermore it is obvious, that the number 
of input values is very small in this example and the DSP calculation time is proportional to that 
number. The XPP performance on the other hand will improve with the number of input values. 
Therefore the XPP performance will be more impressive with bigger image sizes. 



Parameter 


Value (shift register synthesis) 


Value (data duplication) 


Vector length 


16 


16 


Reused data set size 


256 


256 


I/O IRAMs 


3l + IO=4 


8I + lO = 9 


ALU 


27 


21 


BREG- 


21(1 defined + 20 route) 


10 (1 defined + 9 route) 


FREG 


22 (9 defined + 23 route) 


19 (3 defined + 16 route) 


Data flow graph width 


14 


14 


Data flow graph height 


3 (shift registers) + 8 (calculation) 


8 (calculation) 


Configuration cycles (simulated) 


configuration 


2262 


configuration 


2145 




preloads 1 


14*3*4 168 


preloads 


8*8*4 256 




cycles 


14*57 798 


cycles 


14*52 728 




sum 


3228 


sum 


3129 



1 assuming 4 words/cycle burst transfer 
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5.15 Enhancing Parallelism 

After the synthesis the configuration calculating the inner loop utilises 27 ALUs and 4 IRAMs for shift 
register synthesis and 21 ALUs and 9 IRAMs for data duplication, respectively. Assuming a XPP64 
core this leaves plenty of room for further optimizations. Nevertheless, since all optimizations 
enhancing parallelism are performed before the synthesis takes place, it is crucial that they estimate the 
needed resources and the benefit of the transformation very carefully. Furthermore they have to 
account for both input preparation strategies to estimate correct values. 

Loop Unrolling 

Fully unrolling the inner loop would not lead to satisfying results, because the number of inputs and 
outputs increases dramatically. That means data duplication would not be applicable and shift register 
synthesis would exhaust most of the benefits of the parallelism by producing a very long pipeline for 
each data flow graph. Although partial unrolling of die inner loop would be applicable it promises not 
much benefit for the area penalty introduced. 

Loop unrolling the outer loop is also not applicable since it produces a fiirther configuration. 
Nevertheless a related transformation could do a good job on this loop nest 

Unroll-and'Jam 

The unrblKand-jam algorithm enhances parallelism and also improves IRAM usage. It brings pairs of 
iterations together ideally reusing IRAM outputs and calculation results. The algorithm partially 
unrolls the outer loop and fuses the originated inner loops. Before the unroll-and-jam is performed the 
so-called unroll-and-jam factor must be determined which denominates the unrolling fector of the 
outer loop. This is mainly influenced by the number of ALUs/? 64 assuming XPP64) and calculates 

n xpp 64 _ 
t0 c unwii-w-j«m 555 - — ^2 (integer division). 

"tenor loop 27 



Thus the source code would be transformed to. 

for(v=0; v<«VERLEN-3; v+«2)' { 

for(h«*0; h<=HORLEN-3.- h++) { 

p2[v+l][h+l] - min< abs((pl[v+2J [h] -pl[v][h)) + 

(pl[v+2] [h+2] - pl[v][h+2]) + 
2 * tpl[v+2] [h-KL] - pl[v][h+l])> + 
abs((pl[v] [h+2] -pl[v][h]) + 

(pl£v+2] [h+2] ~pl[v+2][h}) + 
2 * (pl[v+l] [h+2] - pl[v+lj(h])), 255)/ 
p2[v+2][h+l] = min{ abs < (pi [v+3] [h] -pl£v+l][h]} + 

{pl[v+3j.[h+2J - pl[v+l] [h+2] ) + 
2 * (pl[v+3] [h+lj - pl[v+l] [h+1])) + 
abs(<pl[v+l] [h+2] -pl[v+l][h]) + . 
(pl[v+3] [h+2] - pl[v+3][h]) + 
} 2 * (pl[v+2) [h+2], - pl[v+2] [h])), 255); 

} ' 

The transformation introduces additional accesses to plfv+3][h], pl[v+3][h-+23 ? pl[v+-3][h+-l], and 
pl[v+l][h-HJ (the former bole in the access pattern) as well as. a write, access to p2[v+2)[h+*l]. That 
means 2 IRAMs more for shift register synthesis (one input, one output) and 5 IRAMs more for data 
duplication (4 input, 1 output), while performance is doubled. 
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Parameter 


Value (shift register synthesis) 


Value (data duplication - no 
IRAM placement) 


V 12CIU1 lCugUl 


16 


16 


XvCUbwU (Mia £>Ci mxc 


256 


256 


I/O TRAMs 


41+20=6 


121 + 20= 14 


ALU 


45 


37 


BREO 


•31 (12 defined + 19 route) 


42 (4 defined + 38 route) 


FREG 


29 (L defined + 28 route) 


18 (1 defined + 17 route) 


Data flow graph width 




14 


Data flow graph height 


3 (shift registers) + 8 (calculation) 


8 (calculation) 


C fm fT miration eveles fcinfiiilated^ - 

w VIaUmUI auuiA vj vies ^aiuiuiawvi^ 


configuration 
preloads 
cycles 
sum 


2753 

7*4*4 112 
7*53 371 
3236 


configuration 
preloads 
cycles 
sum 


2754 

7*12*4 .* 336 
7*69 483 
3573 



Parameter 


Value (data duplication - with 
IRAM placement) 




Vector length 


16 




Reused data set size 


256 




I/O IRAMs 


121 + 20 = 14 




ALU 


37 




BREG 


36 (4 defined + 32 route) 




FREG 


24(1 defined + 23 route) 




Data flow graph width 


14 




Data flow graph height 


3 (shift registers) + 8 (calculation) 




Configuration cycles (simulated) 


configuration 
preloads 
cycles 
sum 


2768 

7*12*4 . 336 
7*51 -357 
3461 







The simulated results are shown in the table above. Please note the differences of the two columns 
labeled, with "data duplication". The first used xmap to place the IRAMs, while in the second the 
IRAMs were placed by hand using a greedy algorithm which places TRAMs that are operands of the 
same operator in one line (as long as this is possible). The second solution improved the iteration 
cycles by 1 8. This shows that IRAM placement has a great impact to the final performance. 



The traditional unroll-and-jam algorithm uses loopi peeling to split the outer loop in a preloop and an 
unroll-able main loop to handle odd loop counts. When we assume for instance* 128 the unroll-and- 
jam factor would calculate to 



uruoH-and-jam 
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Since the outer loop count (14) is not a multiple of 4, the algorithm virtually peels off the first two 

for{v=>0; v<*=VERLEN-5; v+=4) { 

for(h=0; h<«=HORLEN-3 ; h++) { 

p2(v+l][h+l] - rain( abs((pl[v+2] [h] -pl[v][hj) + ■ 

Cpl[v+2] [h+2] - pl[v) th+2]) + 
2* (pl[v+2] [h+1] - plfv] [h+lj)) + 
abs((pirvj [h+2] -pl[v][h)) + 

(pl[v+2J [h+2] - pl[v+2][h)) + 

pat.-] ( h+ 1 ] : min( u^v^L v^Vh?, 3 V ' 255) 

.2 * <pl£v+ 3 ] [h+1] - pltv+1] [h+1])) + 
abs{(pi[ V +l] [h+2] -pl[v+lj[h]) + 
(pl[v+3] [h+2] -pl[v+3][h]) + 

* * , < ?V V t!U h+2J - PUv+2] [h+2]) + 
w A Pl , [v+4] [h+1] - Pl tv+2] [h+1] ) ) + 
abs(( P i[v+2] [h+2] -pl[v+2J[h]) + 
-o * (pllv+4] [h+2] - pl[v+4][h]) +' 

<pl[v+5] [h+2] - pl[v+3][h+2]) + 
2 * (pi [v+S] [h+lj - pl[v+3] [h+1])) + 
abs((pl[v+3J [h+2] -pl[v+3][h]) + 

(Pi [v+5] [h+2] - pl[v+5](h]) + 
2 * <pl[v+4] [h+2] - pl[v+4] [h])), 255); 



5.1.6 Parameterized Function 

Source code 

^^uS^ct^^a £5* ^ tteQ " f ° m ™ «■ Worid application 
with theses of Ae pSo w^k on flmCtI ° n f ° r in P Ut «* ^ys along 

Therefore the source code would look similar to: 

void edge3x3(int -pi, int *p 2 , in t HORLEN, int VERLEN) 



for(v«0; v<«VERLEN-3; v++) . { 
for(h=0; h<=HORLEN-3; h++) 



htmp 



if 



(v+2) 
<v+2) 
Cv+2) 



(*Mpl 
(**(pl 
2 *. <*+(pl 
(htmp < 0) 

htmp a - htmp; 
vtmp = (**(pl + v 
(**(pl + (v+2) 
2 * (*Mpl + (v+l) 
if (vtmp < 0) 

vtmp = - vtmp; 



{ 

HORLEN 
HORLEN 
HORLEN 



HORLEN 
HORLEN 
HORLEN 



h) - **(pl + v + HORLEN 
h+2) - **(pl + v * HORLEN 
- **- {pl +> v * HORLEN 



h)) + 
h+2) )+ 
h+1)); 



h+2) - **(pl + v 
h+2) - **(pl + (v+2) 
h+2) - **(pl + (v+l) 



HORLEN 
* 



+ h)) 
HORLEN + 
HORLEN + 



+ 

h)) + 
h)); 



sum 



htmp + vtmp; 
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if (sum > 255) 

sum = 255; 
**(p2 + (v+1) * HORLEN 4- h+1) - sum; 

> ■ • ...... 

)} 

. This requires some additional features from the compiler. 
■ interprocedural optimizations and analysis 

• hints by the Programmer (e.g. a compiler known assert( VERLEN % 2 = 0) makes unroll-and-jam 
actually possible without peeling off iterations and running them conditionally) 

Fitting the Algorithm Optimally to the Array 

Since HORLEN and VERLEN are not known at compile time these unknown parameters introduce 
some constraints which prevent pipeline vectorization. The compiler must assume that the iRAMs 
cannot hold all HORLEN input values in a row, so pipeline vectorization would not be possible. 

Strip Mining Inner Loop 

Strip mining partitions the inner loop into a loop that runs over a strip, which is chosen to be of the 
same size as the IRAMs can hold and a by strip loop iterating over the strips. Of course the strip loops 
upper bound must be adjusted for the possible incomplete last strip. After the strip mining the original 
code would look like (outer v-loop neglected): 

'for<h=0; h <= HORLEN-3; h+« stripsize) 

for(hh=h; h<=min(h+stripsize-l, HORLEN- 3) ; hh++) { 
•htmp = (**<pl + Cv+2) * HORLEN + hh) - ** (pi + v.* HORLEN + hh) ) + 



} 

} 



Assuming a IRAM size strip size of 256 the following simulated results can be obtained for one strip. 
The values must be multiplied with the number of strips to be calculated. 



. Parameter 


Value (shift register synthesis) 


Valqe (data duplication - with 
* IRAM placement) 


Vector length 


16 


16 


Reused data set size 


256 


256 . 


I/O IRAMs 


41 + 20 = 6 


121 + 20= 14 


ALU 


45 


37 


BREG 


31 (12 defined + 19 route) 


42 (4 defined + 38 route) 


FREG 


29(1 defined + 28 route) 


18(1 defined + 17 route) 


Data flow graph width 


14 


14 


Data flow graph height 


3 (shift register 


s) + 8 (calculation) 


8 (calculation) 


Configuration cycles (simulated) 


configuration 


2753 


configuration 


2754 




preloads 


7*4*64 1792 


preloads 


7*12*64 5376 




cycles 


128*530 67840 


cycles 


128*553 70784 




sura 


72385 


sum 


78914 



The RISC DSP needs about 1.47 million cycles for this amount of data. As mentioned above these 
values do not include cache miss penalties and truly underestimate the real values. Furthermore it can 
be seen that data duplication does not improve the performance. The reason for this seems to be an 
worse placement and routing. 
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5.2 FIR Filter 

5.2.1 Original Code 



Source code: 

#define N 256 
#define M 8 ' 

for <i = 0; i < N-M+l; i++) { 
S: y[i]. - 0; 

for (j = 0; j < M; 
S': y[i] +- C [j] * x [i+M-j-lJ; 
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hi!SSSt^ "* repIa06d ^ ^ ValU6S by f**"—- T*» *** dependence graph 




for (i = 0; i < 269; i++) { 
S:' y[i] = 0; 

for (j » 0; j < 8; j++) 
S': y[ij += c[j] * x[i+7-j); 

We have the following table: 
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Parameter 


Value 


Vector length 


269 


Reused data set size 




I/O IRAMs " 


3 


ALU 


2 


BREG 


0 


rjtvxvj 




Data flow graph width 


i 


Data flow graph height 


2 


Configuration cycles 


2+8=10 



5.2.2 First Solution 

In the case we want to save memory, the straightforward solution is to unroll the inner loop and to use 
shift register synthesis to delay the values of array* in the pipeline. No other optimization is applied 
before as either they do not have an effect on the loop or they increase the need for IRAMs. After loop 
unrolling, we obtain the following code: 

for (i =0; i < 269; i++) { 
y(ij - 0; 

y[i] += c[0] * x[i+?J; 

. V[i] +- c[l] * xti+6],- 

y[i] +- c£2] * K[i«-5]; 

y[i] .+- c[3] * xti+4]; 

y[i] +- c[4] * x[i+3); 

y[I]-+- c[5] * x[i+2]; 

yd) += c{6] * xCi+1]; 

y[i] += c[7] * x[il; 
> . ... 

Then the table looks like this: 



Parameter 


Value 


Vector length 


269 


Reused data set size 




I/O IRAMs 


9 


ALU 


16 


BREG 


0 


FREG 


0 


Data flow graph width 


2 


Data flow graph height 


9 


Configuration cycles 


9+269-278 | 
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Dataflow analyse reveals that y[0Mx[0],...*[7]), y[l]=f(x[l],...*[8J),...,y[i]^(x^^ 
Successive values of y depend on almost the same successive values of x To prevent iinnecessary 
accesses to the IRAMs, the values of x needed for the computation of the next values of v aie kept in 
registers. In our case this shift register synthesis needs 7 registers. This will be achieved on the PACT 
XPP, by keeping them into FREGs. Then we obtain the dataflow graph depicted below An IRAM is 
used for the input values and an IRAM for the output values. The first 8 cycles are used to fill the 
pipeline and then the throughput is of one output value/cycle. We can depict the code as the following: 
rO x[0] 
rl - x[l] 
r2 = x(2) 
r3 - x[3j 
r4 - x[4] 
rS - x[5] 
r6 = x[6] 
r7 = X [7] 

for (i = 0; i < 269; i++) { 

J"! rl; ? * r0 + C6 * rl + ° 5 * r2 + ° 4 * r3 + . c3 * r4 + c2 * rS + cl*r6 + c0 *r7 ; 
rl = r2; ' 
r2 «= r3; 
r3 = r4; 
r4 ■= r5; 
r5 = r6; 
r6 = r7; 
r7 = x(i+7] ; 
1 




The final table is shown below, and the expected speedup with respect to a standard «»™ M L 
processor with 2 instructions issued per cycle is 13.6. standard superscalar 
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Parameter 


Value 


Vector length 


269 


Reused data set size 




I/O IRAMs 


2 


ALU 


16 


BREG 


0 


FREC 


7 


Data flow graph width 


3 


Data flow graph height 


9 


Configuration cycles 


8+269-277 




Ops 


Number 


LD/ST (2 cycles) 


2 


ADDRCOMP (1 cycle) 


0 


ADD/SUB (1 cycle) 


8 


MUL (2 cycles) 


S 


SHIFT (1 cycle) 


0 


Cycles per iteration 


28 


Cycles needed for the loop (2-way) 


(28*269)^-3766 



Variant with Larger Loop Bounds 

Let us take larger loop bounds and set the values of N and M to 1 024 and 64. 

for (i - 0; i < 961; { 
y[il = 0; 

for (j - 0; j < 64; j++) 
■ .y(i] +» c[j] * x[i+63-j]; 

Following the loop optimizations driver given before,, we apply loop tiling to reduce the iteration range 
of the inner loop. We obtain the following loop nest 

for (i a/0; i < 961; { 
y[i] =0; 

for (jj « 0; jj < 8; jj++) 
for (j 0; j < 8; j++) 
. yti] += C[8*jj+j] * x[i+63-8*jj-j]; 

A subsequent application of loop unrolling on the Inner loop yields: 

for <i - 0; i < 961; { 
yCU - 0; 

for (jj' « 0; jj < 8; jj++) { 

yCi] +- c[8*jj] * x[i+63-8*jj]; 
y[i) +« c[8*3j+l] * x[i-f62~8*jjj; 
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} 



+- c[8*jj+2] 
y(i) +- c[8*jj+3] 
y[i] += c[8*jj+4] 
y[i] •+= c[8*jj+5] 
yti] += c[8*jj+6] 
y[ij += c[8*jj+7] 



x[i+61-8*jj] ; 
x[i+60-8-3 j] ; 
xfi+59-8*jj] ; 
xti+58-8*jj] ; 
x[i+57-8*jj]; 
x[i+56-8*j j ) / 



Finally we obtain the same dataflow graph as above, except that the coefficients must be read from 
another (RAM rather than being directly handled like constants by the multiplications After shift 
register synthesis the code is the following: 

for (i - 0? i < 961; i++) { 



rO 




X[i+56]; 


-rl 




K[i+57] ; 


r2 




x[i+58] ; 


r3 




x[i+59]; 


r4 




xti+60] ; 


r5 




x[±+61] ; 


r6 


fir 


x[i+62] / 


r7 




x[i+63] ; 



for (jj 



0; jj < 8; jj++) 

y[i) = c[B*jj]*rO + cC8*5j+lJ*rl + c [8*jj+2] *r2 + c[8*ii+31* r3 + 

c[8*jj+4]*r4.+ c[8*jj+5]*r5 + c[8*j j+6] *r€ + c[8*ji+7] *r7- 
rO = rl ; . 
rl = r2; 
r2 = r3; 
r3 = r.4; 
r4 « r5; 
r5 = r6; 
r6 =» r7; 

r7 = x[i+63-8*j;j] ; 



J^IS^ * e S *T bef0re eXCCpt for * e VeCtor len ^ h md ** e *P«* d s P^d«P with respect to 
a standard superscalar processor with 2 instructions issued per cycle is 17.5. ^ 



Parameter 


Value 


Vector length 


8 j 


Reused data set size 




r/O IRAMs 


2 


ALU 


16 . 


BREG 


0 


FREG 


7 


Data flow graph width 


3 


Data flow graph height 


9 


Configuration cycles 


8+8=16 
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Ops 


i^umiier 


LD/ST (2 cycles) 


in 


ADDRCOMP (I cycle) 


a 

V 


ADD/SUB (I cvcle> 


10 


MUL (2 cycles) 


17 


SHIFT (1 cycle) 


0 


Cycles per iteration 


70 


Cycles needed for the loop (2-way) 


(70*8)/2=280 



5.2.3 A More Parallel Solution 

The solution we presented does not expose a lot of parallelism in the loop. We can try to explicitly 
parallelize the loop before we generate the dataflow graph. Of course exposing more parallelism 
means more pressure on the memory hierarchy. 

In the data dependence graph presented at the beginning, the only loop-earned dependence is the 
dependence on S' and it is only caused by the reference to yfij. Hence we apply node splitting to get a 
more suitable data dependence graph. We obtain then: 

for (i =0; .i < 249; { 
y.£i] .= 0;. 

for (j = 0; j < 8; j++> 
* { ■ 

tmp =* clj] * x[i+7-j] ; 
y [i] += tmp; 

> - 

} 

Then scalar expansion is performed on tmp to remove the anti loop-carried dependence caused bv it 
and we have the following code: . 

for (i » 0; i < 249; •{". 
yfi] - 0, 

for (j - 0; j < 8; 

( " . 

• tmpfj) = c[jj- * x[i+7-j]; . - 

y[i] += tmp[jj ; 

'}•-. 

r 



The parameter table is the following: 
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Parameter 


Value 


Vector length 


249 


Reused dam set size 


- 


I/O IRAMs 




ALU 


2 


BREG 


. o . 


FREG 


1 


Data flow graph width 


2 


Data flow graph height 


2 


Configuration cycles 


2+8=10 . 



Then we apply loop distribution to get a vectorizable and a not vectorizable loop. 

for (i - 0; i < 249; { 
y[i] - 0; 

for (j =s 0; j < 8; j++> 
t^vpfj] * clj] * xti+7-j]; 
' for (j - 0; j < 8; j++) 
yTi] txnp[j] ; 

} 

} 



The parameter table given below corresponds to the two inner loops in order to be compared with the 
preceding table. 



Parameter 


Value 


Vector length 


249 


Reused data set size 




I/O IRAMs • 


5 


ALU 


2 


BREG 


0 


FREG 


1 


Data "flew graph width 


I 


Data flow graph height 


3 


Configuration cycles 


1*8+1*8=16 



Then we must take into account the architecture. The first loop is fully parallel; this means that we 
would need 2*8=*! 6 input values at a time. This is all right, as it corresponds to the number of.IRAMS 
of the PACT XPP. Hence we do not need to strip-mine the first inner loop. The case of the second 
i loop is trivial, it does not need to be strip-mined either. The second loop is a reduction, it computes the 
sum of a vector. This is easily found by the reduction recognition optimization and we obtain the 
following code. 
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for (i =0; i < 249; i++) { . 
y[i] - 0; 

for (j = 0; j < 8; ;)++) 

tmp[j] - C[j] * x[i+7-j]; 

/* load the partial sums from, memory using a shorter vector length */ 
for (j - 0; j < 4; j++) 

aux[j) = tmp[2*j] + tmp[2*j+l); 

/* accumulate the short vector */ 
for - 0;j < 1; j++) ■ 

aux(2*j] » aux[2*j] + «ux[2*j+l]; 

/* sequence of scalar instructions to add up the partial sums */ 
y[i] - auxIO] + aux[2]; 

) 

Like above we give only one table for ail innermost loops and the last instruction computing yfij. 



Parameter . • 


Value 


Vector length 


249 


Reused data set size 




I/OIRAMs 


12 


ALU 


4 


BREG 


0 


FREG • 


0 


Data flow graph width 


1 


Data flow graph height 


4 


Configuration cycles 


1*8+1*4+1*1=13 . 



Finally loop unrolling is applied on the inner loops, the number of operations is always less than the 
number of processing elements of the PACT XPP. 

for (i'= 0; i < 961; i++) 



{ 



tmp[0] 




c[0] * 


x[i+7); 


tmp[l] 




c[l] * 


x[i+6] ; 


tmp[2] 


= 


c.[2] * 


x[i+5] ; 


tmp[3] 


. S3 


c[3] * 


x[i+4] ; 


tmp[4] 




c[4J * 


x[i+3] ; 


trap [53 


GS 


c[5] * 


x[i+2] ; 


tmpC 6] 




c[6] * 


xEi+1] ; 


tropP] 




c[7] * 


x[i];. 


aux[0] 




troptO] 


+ tmp[l)7 


aux[l] 


ST 


trap [2] 


+ tmp[3] ; 


auxt2] 




tiup[4] 


+ tmptS]/ 


aux [3] 


=2 


tirtp[6] 


+ tmp.[7] ; 


auxCO) 


■= 


aux[0J 


+ aux 1 1 ] ; 


aux [2] 




aux [2] 


+ aux[3] ; 


y[i] - 


aux[0J + 


aux [2] ; 



} 
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We obtain then the following dataflow graph representing the inner loop. 




synchronized, the throughput reaches one itereWcycte after TcvcL tn fin • ,S ,- ^ 
cedents are taken as cor^t inputs by the p, P e " ne ; ™~ 

* — d in a 

constant for each ALU. But due todata ^tocaht^fthW 1 coefficients are handled like 

reside in the cache. And as the data ^X^cn? tTthT^AM * ** S^* 

efficiently,.** configuration can be executed oTthe^CT Sp JL* ! J^^JV* 1 ^ 
. ready in the IRAMs. The parameter table is then le follllL^ f ° r the to be 



Parameter 


[_ Value 


Vector length 


249 


Reused data set size 




I/O IRAMs 


16 . 


ALU 


15 


BREG • 


0 


FREG 


0 


Data flow graph width 


8 


.Data flow graph height 


4 


Configuration cycles j 


4+961 



Variant with Larger Bounds 



To make thethings a bit more interesting, we set the values of N and M to 1024 and 64. 
for' <i = 0; i < 961; / 
YtiJ =0; 

for (j = 0-; j < 64; j++) 
ytil += c[j) '* x[i+63-j] ,- 
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for U = 0; i < 961; i++) { 
y[ij - 0; 

for (j - 0; j < 64; j++) 
{ 

tmp « c[j] * x[i+63-3]/ 
y[i] +*= tmp; 

} 



After scalar expansion: 

for (i » 0; i < 961; i++) { 
y[i] - 0; 

for (j - 0; j < 64; 

i. 

tmptj] = c[jj * x[i+63-j]; 
yti] *- tmp[j] ; 

} 

) 

After loop distribution: 

for (i « 0; i.< 961-; { 
y[i] - 0; 

for (j » 0; j < 64; 

tmptj] - c[j] * x[£+63-j]; 
for (j = 0; j < 64; j++) 
y[ij +- tmp [3 J ; 

> 

} 

We go through the compiling process, and we arrive to the set of optimizations that depends upon 
architectural parameters. We want to split the iteration space, as too many operations would have to be 
performed in parallel, if we keep it as such- Hence we perform strip-mining on the 2 loops. We can 
only access 16 data at a time, so, because of the first loop, the factor will be 64 * 2/16 = 8 for the 2 
loops (as we always have in mind that we want to execute both at the same time on the PACT XPP). 

for {i =0; i < 961; i++) { 

y[i] =0; ' 
for (33 * 0; j3 < 8; jj++) 
for <j«0;j < 8/ j++>* 

tmp[8*jj+j] - c[8*jj+j] * x[i+63-8*jj-j]; 
for (jj - -0; j j < 8 ; 33++) 
for (j=0;j < 8; 3++) 

yti] +='tmp£8*33+j] ; 



And then loop fusion on the jj loops is performed. 

"for (i ^ 0; i < 961; i++) { . 
y[i] - 0; 

for (3j - 0; jj < 8; jj++) { 
for (j=0;j < 8;j++) 

tmp[8*jj+j] - c[8*jj+j].* x[i+63-8*j j-j J ; 
for (j=0;j < 8;j++) 

yTi] += tmp[8*j3+j]; 
} • 



Now we apply reduction recognition on the second innermost loop. 
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for <i = 0; i < 961; i++) { 
tmp = 0; 

for (jj = 0; 33 < 8; jj++) 
{ 

for {j =0; j- < 8; j++) 

tmp[8*jj+j] - c[6*jj+j] * x[±+63-8*jj-jJ; 

/ %o? a ? j t !! e brr< a i; S ^) fr0m mem ° ry US±ng 3 3h ° rter VeCtor V 
auxtj] - trap [8*3 j+2*j] + trap [8*33+2*3+1] ; 

/* accumulate the short vector */ 
for (j « 0;3 < 1; 3++) ■ 
aux[2*j] = aux [2*3] + aux[2*3+i] ; 

} 

And then loop unrolling. 

•for (i - 0/ i < 961; i++) 

for (33 = 0; 33 < 8; 33++) 
{ 

trap (8*33] = c [8*33] * x [1+63-8*33 ] ; 
tmp[8* 3 j+l] = cCB*jj+l1 * x[i+62-8*ij]; 
trap[8*33+2] * c[8*jj+2] * x[i+61-8*jj . 
• trop[8* 3 3+3J = e(8*33 + 3] * x[i+59-8*13 \ \i 
tmp[8*33+4] - c(8*3j+43 * x[i+58-8*ii] ; 
tmp [8*33+5] = c[8*33+5] * x[i+57-8*3i ; 
tmp[8* 33 +6] = c[8*33+6] * x[i+56-8*311 • 
.tmp[8*33+7] = C[8*33+7] * X [i+55-8*33]; 

aux[0J m trap [8*33] +^[8*33+1]; 

aux[l) = tmp[8*3D+2] + tmp[8*33+3] 

aux[2] * tmp[8*33+4J + tmp[B*33+S] 

aux[3] = tmp[8*3D+6] + tiap[8*33+7] 

aux[0] = aux[0] + aux[l]; \ 
aux [2] = aux [2] + aux [3]; 

^ y[i] = aux(0] + aux[2] ; 
We implement the innermost loop on the PACT XPP Aimtt*, ..„-*i, „ ~. . — ^ ' 

address is constant li^s^tTu^d tot to ™ mOTy ' 38 * * a S ,obal "V ** 
final parameter table is LTSng: COUnter of Ae -^-tion. The 
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value 


v ccior icn&vn 


o 
o 






I/O IRAMs 


IO 


ALU 


IS 


BREG 


. 0 


FREG 


0 


Data flow graph width 


8 


Data flow graph height • 


4 


Configuration cycles 


4+8=M2 



Nevertheless it should be noted that this version should be less efficient than the previous one. As the 
same data must be loaded in the different IRAMs from the cache, we have a lot of transfers to achieve 
before the configuration can begin the computations. This overhead must be taken into account by the 
compiler when choosing the code generation strategy. This means also that the first solution is the 
solution that will be chosen by the compiler. 

5.2.4 Other Variant 

Source Code 

for (i « 0; i < N-M+l; i++) { 
tmp = 0; 

for (j = 0; j < M,- j++) 

tmp c[j] * x[i+M-j-lJ; 
x[i] = tmp; 

} . 

In this case, it is trivial that the data dependence graph is cyclic due to dependences ontmp. Therefore 
scalar expansion is applied on the loop, and we obtain in fact the same code as the first version of the 
FIR filter as shown below. 

for (i = 0; i < Nr-M+1; i++) { 
tmp[i] = 0; 

for (j « 0; j < M; j++J 
• tmp[i] ;+- c[j] * x[i+M-j-l]; 
'«= tmp[i] ; 

} 



« 

EmPf an s sz e i t 2Juli 16:22 



87 



5.3 Matrix Multiplication 



5.3.1 Original Code 

Source code; 

#define L 10 
#define M 15 
#define N 20 

int A [L] [M] - 
int b£M] [N]; 
int R[L] [N] ; 

maino i 

int i, j, k, tmp, aux; 

/* input A (L*m values) */ 
for(i=0; i<L; i++) 

for(j=0; j<M; . 

• scanf {nd", &A[i] [j ] ) ; 

/* input B (M*N values) */ 
for{i«=Q; i<M; i++) 
for(j=0; j<N; 

■scanf <"*d w , &3[i][3J> ? 

/* multiply */ 

foar(i=0; i<L;i++) * • 

. for(}=0; j<N; { 
aux 0; 

for(k«0; k<M; k++) 

/* write data stream */ 
for(i=0; i<L; i++) 

for(j=0; j<N; j++) 

prinfcf r%d\n", R[iJ [j]) f - 



5.3.2 Preliminary Transformations 

Since no inline-able function calls are present, no interprocedural code movement is done. 

^onte^^V^ I** £ e muIti P Iy * r comment is *• only candidate for running 

^ Ca,,S in < he ,0 °P *«* - - therefore *£SX 



candidates very early in the compiler. 

Dependency Analysis 

for(i=0; i<L;i.++) 

for (j=0; j< N ; ( 
S1 aux ■»• 0; 

for(lc=0; Jt< M; k++) 
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l 



S2 
S3 



aux += A[iJ [kj * B[k] [j]; 
R[i] [jl » aux; 



® 



1 5 ™ 

(s2> 



® 



Figure 59 Data dependency graph for matrix multiplication 

The data dependency graph shows no dependencies that prevent pipeline vectorization. The loop 
carried true dependence from S2 fo itself can be handled by a feedback of aux as described in [1]. 

Reverse Loop-Invariant Code Motion 

To get a perfect loop nest we move SI and S3 inside the k-loop. Therefore appropriate guards are 
generated to protect the assignments. The code after this transformation looks like 

for(i=0; i<L;i++) 

for(j«0; j<N; j++) 

for(k«0; k<M; k++) { 

if (k 0) aux « 0; 

aux +« A[i] [k] * 3[k] [j]; 

if (k — M-l) REiJCj] = aux; 

} 

Scalar Expansion 

Our goal is to interchange the loop nests to improve the array accesses to utilize the cache best 
Unfortunately the guarded statements involving aux cause backward loop carried anti-dependences 
carried by the j loop. Scalar expansion will break these dependences, allowing loop interchange. 

for(i=0; i<L;i++) 

for(j=0; j<N; j++) 

for(k=0; k<M; k++) { 

if (k 0) aux[j] - 0-; 

aux[j] +« A[i)[k] * B[k}.{j];' 

if (k==M-l) RtilCj] «aux£3Ji. 

) 



Loop Interchange for Cache Reuse 

Visualizing the main loop shows the iteration spaces for the array accesses (Figure 60). Since C arrays 
are placed in row major order the cache lines are placed in the array rows. At first sight there seems no 
need for optimization because the algorithm requires at least one array access to stride over a column. 
Nevertheless this assumption misses the fact that the access rate is of interest, too. Closer examination 
shows that array R is accessed in evexy j iteration, while B is accessed every k-iteration, always 
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producing a cache miss 2 . This leaves a possibility for loop interchange to improve cache access as 
proposed by Kennedy and Allen in [7]. 




Figure 60 The visualized array access sequences. 

Finding the best loop nest is relatively simple. The algorithm simply interchanges each loop of the 
neste into the innermost positron and annotates it with the so-called innermost memory ^o* term This 
cost term .s a constant for known loop bounds or a function of the loop bound for untaro^'looo 
bounds. The term is calculated in three steps. unrmown loop 

■ First the cost of each reference 3 in the innermost loop body is calculated to 

• " loop 1,16 referenCe dDeS " 0t depend on ,0 °P induction variable of the (current) innermost 

- the loop count; if the reference depends on the loop induction variable and strides over a 
noncontiguous area in respect of the cache layout 

' ~J~> ,f refere nce depends on the loop induction variable and strides over a contiguous 
re^ecrive.y In ^ N * l °° P ^ S h and b is me **• line size, 

■ Second each reference cost is weighted with a factor for each other loop, which is 
■ I y if the reference does not depend on the loop index 

■■ the loop count, if the reference depends on the loop index. 
- Third the overall'loop nest cost is calculated by summing the costs of all reference costs. 

After invoking this algorithm for each loop as the innermost, the one with the lowest cost is ch6sen as 
the innermost, the next as the next outermost, and so on. e lowest cost rs chosen as 



us^tM 

Lm^r«f^™ S f V" ^ ta Ais Since transformation wants to optimize cache access it 
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Innermost loop 


R[i]01 


A[i][k] 


BMW 


Memory access cost 


k' 


VL'N 


M . 

T 


M-N' 


L.N + — L + M-N 
b 


i 


l-L-N 




l-M-N 


L N+L'M + M N ■ 


j 




L-M 


-M 


^(L+M)+L-M 
b 



Table 1 Loop memory access costs for the different loops being 

innermost 

The table shows the values for the matrix multiplication. Since the j term is the smallest (of course 
assuming !&> !), the j-loop is chosen to be the innermost. The next outer loop then is k, and the 
outermost is i. Thus the resulting code after loop interchange is 

for(i»0; i<L;i++) 

for(k=0; k<M; k++) 

for(j=0; j<N; j++) { 

if (k = 0) aux[j] « 0; 
aux[j] A[i] [k] * B[k][j3; 
if <k — M-l)R[i] [j] - aux[j]; . 
} . - 




figure 61 The visualized array access sequences after optimization. Here the Improvement is visible to the 
naked eye, since array B is now read following the cache lines. 



Figure 61 shows the improved iteration spaces. It is to say that this optimization does not optimize 
primarily for the XPP, but mainly optimizes the cache-hit rate, thus improving the overall 
performance. 
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Unroll and Jam 

After improving the cache access behavior, the possibility for reduction recognition has been 
destroyed. This is a typical example for transformations where one excludes the other. Nevertheless 
we obtain more parallelism by doing unroli-and-jam. Therefore we unroll the outer loop partially with 
the unroll factor. This factor is mainly chosen by the minimum of two'calculations. 

■ # available IRAMs / # used I RAMs in the inner loop body 

• # available ALU resources / # used ALU resources in the inner loop 

In this example the accesses to "A" and "B" depend on k (the loop which will be unrolled) Therefore 
they must be considered in the calculation. The accesses to "aux" and "R" do not depend on tc Thus 
they can be subtracted from the available IRAMs, but don not need to be added to the denominator 
Therefore we calculate (assuming an XPP64) 14/2 = 7 for the unroll factor obtained by the iRAM 
resources. 

On the other handle loop body involves, two ALU operations (1 add, 1 mult), which yields an 
unrolling factor of approximately 64/2 = 32?. The constraint generated by the IRAMs therefore 
dominates by far- 
Having chosen the unroll factor we must trim our loop trip count to be a multiple of that factor- Since 
the k loop has a loop count of 15, we peel off the first iteration and unroll the remaining loop. 

for(i=0; i<L;i++) {* 

for(k=0; k<l; k++) { 

for(j=0; j<N; j++) .{ 

L£ (k=?0) aux[j] « 0; 
aux[j] +»A[iI[k] * B[k][jJ; 
^ if (k— M-l) R[±][jl - aux£jj; 

i 

£or(k-l; k<M; k+=7) { 

fOr(j«0; j<N; { 

if (k««»0) aux[j] - 0; 
aux[Jl +- Afi][k] * B[k][jl; 
if (k— m-1) RfiJtj] ^ aux[j]; 

for(j=0; j<N; j++) { 

if (k+1— 0) aux[j] =0; 
.auxfj] +»A[i][k+l] * B[k+U[j]; 
^ if (k+i—M-l) R[i]£j] - aux[j];- 

*£pr(j=0j j<N; j++) { 

if (k+2=-0) aux[j) - 0; 

aux[jj +« A[i][k+2) * B[k+2][33; 
^ if (k+2— M-l) RriKj]. -\auxCd).; 

for(j«0; j<N; j++) { 

if <k+3=0) aux[j] - 0; 
aux[j] +=A[i)[k+3] * B[k+3][j); 
• ^ if ^ (k+3— M-1) R[i]fj] ^auxtjj; 

if (k+4=0) aux[jj = 0/ 

aux[j]- +» A[i] [k+4] * B[k+4][j]; 

if (k+4==M-l) R[iJ fj] aux[j3 ; 



This is a very inaccurate estimation, since it neither estimates the resources spent by the controlling network, 
which decreases the unroll factor, nor takes it into account that e.g.the BREG-PAEs also have an adder which 
Increases the unroll fector. Although it has no influence to this example the um»H factor calculation of cours© 
has to account for this in a production compiler. 
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for(j=0; j<N; j++) I 

if (k+5— QY .auxtj] - 0; 

auxtj] += A[i] [k+5] * B[k+5][j],\ 

if <k+5==M-l) RtDtj] - auxtj]; 

for(j=0; j<N; i 

if (k+6==0) auxtj] - 0; 

auxtj] += A[i)(k+6] + Btk+6]tj]> 

if (k+6=M-l) RtiHj] - auxtj]; 

) 

> 

1 ' 

Due to the fact that the reverse loop invariant code motion placed the loop invariant code into the iraier 
loop which is now duplicated seven times, it is very likely that dead code elimination can get nd of 
some of these duplicates. Thus the code is shortened to 

for(i=0; i<L;i++) { 

for(k=0; k<l; k++) { 

for(j=0,- j<M; j++) ( . 
if (k=0) auxtj-1 - 0; 
auxfj] +- A[i] [k] * B[k][j]; 

> 

} 

for(k=l; k<M; k+=7) { 

for(j=0; j<N; I ' 

auxtj: += A[i] tk] * Blk] [j]; 

> 

for(j=0; j<N; j++) { 

aux[j] += A[iJ tk+l] *Btk+l][j]; 

} 

for(j=0; j<N; j++) I 

auxtj] +- A[i) tk+2] * Bfk+2] [j.]; 

) • 
fozr(j=0; j<N; j++) < 

auxtj] += A[i] lk+3] * Btk+3][j];. 

) 

for(j=0; j<N; j++) {• 

auxtj] +=>A[i]tk+4] * Btk+4]tj]/ 

} 

for(j=0; j<N; j++) I 

auxtj] += A(i] tk+5] *B[k+5]tjl; 

> • 

for(j=0; j<N; j++) { 

auxtj] += Ati] tk+6]. * Btk+6] [j]; 
if (k+6—M-l) RfiHj] = auxtj];. 

} ' 

J 

} 

Before we jam the inner loops we have to account for the fact that the first iteration of the k loop was 
peeled of which would produce an own configuration. Since we calculated the unroll-and-jam factor to 
fit into one configuration, this side effect has to be prevented. Because it should be no problem to run 
the fc loop with variable step sizes, we fuse the k loops again and adjust the step size and guard the 
statements. This yields 

for(i=0; i<L;i++) { 

for(k=0; k<M; k+= k<l ? 1 : 7) { 
for (j-0; • j<N; j++V I 

if (k«=~0) auxtj] «■ 0; 

if (k==0) aux(j] += Atilfk] * Btk][jl; 

1 

for(j=0; j<N; j++) I 

..,«»-:•• 

67 
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^ if (k>0) aux[j] += A[iJ[k] * B[k][j]; 

for(j=0; j<N; j++) { 

if (k>0) ausCj] += A[i] [k+1] *B[k+l]tj]; 

for (j=0; j<N; j++) { 

if (100) aux[j] += A£i] Ck+2] * B[k+2J[j]; 

£or(j=0? j<N; { 

^ if (k>0) aux[j] += A[i] £k+3] * B[k+3][j], 

for (j=0/ j<N; 3++) { 

^ if (k>0) aux[j] += A [i][k+4J *B[k+4]Cj); • 

• for (j=0; j<N; { 
^ if (k>0) aux[j] += A[i] [k+5] *B[k+5][j]; 

. for (j=0; j<N; j++) { . 

if (k>0) aux[j] += A(i]fk+6] * Btk+6]fj]; 
^ if (k+6==M-l) R[i][j] = auxtj] ; 

} ' 

Now we can jam the inner loops and finally obtain 

for(i=0; i<L;i++) { 

for(k=0; k<M; k+= k<l ? 1 : 7) { 
for(j=0; j<N; { 

if (k=«0)- auxtj] = 0; 

ii (k'otY"*" 1 + ~ MiHkl * B Wfjl; 

auxfj] += A(i] Ck] * B[kJ[3); 

auxfj] += A£i] [k+1] * B[k+l][i]; 

aux[j] += A[i][k+2] * 8[k+2][j]; 

auxtj] += A[i] [k+3] + B[k+3][j]; 

•auxfj] += A[i] fk+4] •* B(k+4]tj]; 

aux[j] +=A[i][k+5] *• B[k+5][j]; 

at£x £3l +- A[i] [k+6] * B[k+6].[j]; 
^ if (k+6==M~l) R[i][j] = auxtj]; 

} 

}. 

}• 

5.3,3 XPP Code Generation 

™t^aTI°^ IO ° P 08,1 b * s y xrtbesi ^ Bd *•» configuration, which uses .14 IRAMs for the incut data- 
one TRAM to temporary store aux and one IRAM for the output array R. Furthermore it is necEJv^ 
pass the value of k to the XPP to direct the dataflow. This may be done bTa sZ^ing i^uf fS« 
62 shows the dataflow graph of the synthesized configuration. ^streaming inputFigure 
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Figure 62 Dataflow graph of matrix multiplication after unroll and jam. The rightmost 3 branches are omitted 
Svem connections are emphasized in red color. . . 

The following code shows the pseudo code executed on the RISC processor. 
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XPPPreload (conf ig) 
for(i=0; i<L;i++) { 

XPPPreloadfO, «Ui]£0], M) 

XPPPreloadfl, &A(i][0J, M) . 
XPPPreload(2, &A[i)(0], M) 
XPPPreload(3, SATiJCO], M) 
XPPPreload(4, &A[i][0), M) 
»PPreioad(5, &A£iJ[0], M) 
XPPPzreload(6, &A{i)[OJ, M) 
XPPPreXoadCiean(15, SR[i] [0] , M) 
forfk-O; k<M; k+«= k<l ? l : 7, { 

XPPPrel©ad(7, «B[k][0), N) 

XPPPr«load<8, &B[k+l][0], N) 

XPPPr©load(9, &B[k+2][0J, N) 

XPPPreloadUO, &B[k+3jrO], N) 

XPPPreload(ll, «B[k+4][0], N) 

XPPPi:eXoad(12, &B[k+5J(0J, N) 

XP»reload(13,. &B[k+ 6 ](0], N ) 

Xec ute(config, ^ (0) ' IRAM(2), iram^, 

. IRAMC4), IRAM(5), IRAM(6 ; iram 7 ' 

of 200-300 percent compared to a standalone RISC cor? ' ValUeS pr ° mise ^ovemente 




26 (8 defined + 18 route) 



28(4 defined +24 route) 



Data flow graph height 



Configuration cycles (simulated) 



14 



6 (without routing and balancing) 



configuration 
preloads 

cycles 



sum 



I 2633 
10*3*7*5 1050 
10*7*15 1050 

0F*O)H2 + 

dp— 1)100 + 

(k=7) 100 

*10« 3120 
7«S3 
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5.4 Viterbi Encoder 



5.4.1 Original Code 

Source Code: 

/* C-language butterfly'*/ 
#define BFLY(i) {\ 

unsigned char metric/ mO, ml, decision; \ 

Mtietric « { (Branchtab29 l[i] " synil) + 

(Branchtab29~2[i] A - sym2) +l)/2;\ 
mQ = vp->old metrics [ij + metric; \ 
. ml = vp->old3netrics[i+128] + (15 - metric). ;\ 
decision - (mO-ml) >= 0;\ 

vp->new_metrics[2*i] = decision ? ml : m0;\ 
vp->dp->w[i/16J" 1= decision « ((2*i)&31)/\ 
mO (metric+metric-15) ;\ 
ml +- (metric+metric-15) ; \ 
decision = (mO-ml) >= 0;\ 

vp->new_raetrics[2*i4-l] » decision ? ml i m0;\ 
vp->dp~>w£i/16] |« decision << ( <2*i+l) &31) ; \ 

} 

int -update_viterbi29(void *p,. unsigned char syml, unsigned char sym2){ 
int i; 

struct v29 *vp = p;* 
unsigned char *tmp; 
int normalize = 0; 

for(i=0;i<8;i++) 

vp->dp->w[i] =0; - 

for<i=0;i<128;x++) • 
BFLY(i); 

/* Renormalize metrics */ 
if (vp->new_metrics[0) > 150) { 
int i; ' 

unsigned char minmetric - 255; 

for (i=0;i<64,-i++) ' 

if (vp->new_metrics [i] < minmetric) 

minmerric « vp->new_metrics [i] ; 
for (i=0;i<64;i++) 

vp~>new_metrics [i] — minmetric;" 
normalize = minmetric; 

I 

vp->dp++; 

tmp = vp->oldjtietrics; 
vp^old^metrics vp->new_metrics; 
vp->new_n*etrics » tmp; 

return normalize; * 

} 
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5.4.2 Interprocedural Optimizations and Scalar Transformations 

Since no inline-able function calls are present, no interprocedural code movement is done. 

After expression simplification, strength reduction, SSA renaming, copy coalescing and idiom 
recognition, the code looks like (statements reordered for convenience). 
Note that idiom recognition will fmd the combination. of minO and use of the comparison result for 
decision and jlecision* However the resulting computation cannot be expressed in C, so we describe it 
as a comment: 

int update_viterbi29(vaid *p, unsigned char syral, unsigned char sym2) { 
int i? * 
struct V29 *vp « p; * 
unsigned char *tmp; 
int normalize «= 0; 

char *_vpdpw_j= vp->dp->w; 
for(i=*0;i<8;i++) 
*_ v Pdpw_*+ « 0; 

char *_bt2 9_l~ Branchtab29 1;* 
char *_bt29_2= Branchtab29~2; 
char *^vpom0* vp->old_meT;rics ; 
char *_vpoml28« vp->old_metrics+128; 
char* *_vpnm- vp^>new^metrics; 
•char *_vpdpw=v vp->dp->w; 

for (i«0;i<128-;i-H-) { 

unsigned char metric, _tmp, m0,ml,_mO,_ml, decision, ^decision; 

metric =» ( { *_bt:29_l++ A syml) + 

(*~bt29_2++ A sym2) + l)/2; 
^tmp= (metrie+metric-15) ; 
mO » *^vpom++ + metric; 
ml = *~vpoml28++ + (15 - metric); 
_m0 * mO - _tmp; 
_jnl « ml f _tmp; 
// decision <= mO >= ml; 
// ^decision .= _m0 >= ml; 

*_vpnm++ <= min(mO,ml) ;~ // « decision ? ml : mO 

' *_ypnm++ - min(_m0,_ml) ; // = ^decision ? ml : mO 

_vpdpw[i » 4J | = ( mO ml) /* decision*/ « ( (2*i) £ 31) 
I (_m0 >- _ml) /*_decision*/ « ( (2+jL+l) &3lJ ; 



/* * Renormalize metrics */ • 

if (vp->new_metrics[OJ > 150) { 

int i; • 
unsigned char minmetric = 255; 

char *_vpnm= vp->new_metrics ; 
for(i«0;i<64;i++) 

minmetric « min (minmetric, •*vpnm++); 

char *_vpnm= vp-> neurometries ; 
for (1=0 ; i<64 / i++) 

*vpnm++ -= minmetric; 
normalize minmetric; 
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vp->dp++; 

tmp =f vp->old_metrics; 
vp->oldjmetrics « vp->new — metrics; 
vp->new_metrics - tmp; 

return normalize; 

) 



5A3 Initialization 

The first loop (setting vp->dp->w[0..7] to zero) is most efficiently executed on the RISC. 



5 A4 Butterfly Loop 

The second loop (with the BFLY() macro expanded) is of interest for the XPP compiler and needs 
further examination: 

char *iraxnO= Branchtab29_l; // XPPPreload<0, Branchtab29_l, 128/4); 

char *iram2« Baranchtab29_2; // XPP Preload (2, Branchtab29J2, 128/4); 

char *iram4= vp->old metrics; // XPPPreload(4, vp->old_metrics, 128/4); 
char *iram5« vp->old~metrics+128; // XF*epreload(5, vp->oldjrnetrics4-128, 128/4) ; 
.short *iram6- vp->new_metrics; // XP£»Preload(6, vp->newjaetrics, 128/2); 
unsigned long *iram7- vp->dp->w; // XPPPreload(7, vp->dp->w, 8); 
// symi & sym2 are. in I RAM 1 & 3 

for(i=0;i<128;i++) { 

unsigned char metric,_tmpf m0 , ml , _m0 , _mX 

metric - ( <*iramO++ ~ syml) + 

(*iraml++ A sym2) + l)/2; 
_tmp= (metric « .1) -15; 
mO « *iram2++ + metric; 
ml - *iram3++ + (15 - metric); 
_m0 - mO - _tmp; 
jnl - ml + _tmp; 

// assuming~big endian; little endian has the shift on the latter min() 
*iram6++ = (min(mO r ml) « 8) | min(jraO#_ml) ; 
*iram7[i » 4] |- ( mO ml) « <<2*i) & 31) 
| (_m0 >= _ml) « ( (2*i+l) &31) ; 

) 

The data flow graph is as follows (for now ignoring the feet, that the IRAM accesses are mostly char 
accesses). The solid lines represent data flow, while the dashed lines represent event flow: 
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raranieier 


Value 


Vector length 


128 


Reused data set size 




I/O IRAMs 


si iT JZXJ 


ALU 


ZD 


BREG 


I6W 




few 


Data flow graph width . 


4 


Data flow graph height ' 


11 


. Configuration cycles 


11+128 



We immediately see some problems: IRAM7 is fully busy reading and rewriting the same address 
sixteen times. Loop tiling to a tile size of sixteen gives ^redundant load store elimination a chance 
to read the value once and accumulate the bits in the temporary, writing the value to the IRAM at the 
end of this inner loop. Loop Fusion with the initialization loop then allows propagation of the zero 
values set in the firet loop to the reads of vp->dp^>w[i] (IRAM7), eliminating the first loop altogether. 
Loop tiling with a tile size of 16 also eliminates thecfe 31 expressions for the shift values: Since the 
new ihner loop only runs from 0 to 16, the value range analysis now finds that the<& 31 expression is 
not limiting the value range any further. 

All remaining input IRAMs are character (8 bit) based. So we need split networks to split the 32-bit 
stream into four 8-bit streams which are then merged. This adds 3 shifts, 3 ands and 3 merges for 
every character IRAM. The merges could be eliminated, when unrolling the loop body. However, 
unrolling is limited to unrolling twice due to ALU availability as well as due to the fact, that IRAM6 is 
.already 16 bit based: unrolling once requires a shift by 16 and an or to write 32 bits in every cycle; 
unrolling further cannot increase pipeline throughput any more. So the body is only unrolled once, 
eliminating one layer of merges. This yields two separate pipelines, that each handle two eight bit 
slices of the 32-bit value from the IRAM, serialized by merges. 

The modified code now looks like (unrolling and splitting omitted for simplicity): 



char *iramO= Branchtab29_l; . // 

char *iraro2= Branchta*>29~2; • // 
char *iram4 = vp->old_met:ric5; // 
char *iram5= vp->oldjmetrics+128; // 
short *iram6= vp->new_me tries; // 
unsigned long *iram7= vp->dp->w; // 
// syml $.sym2 are in IRAM 1 & 3 



XPPPreload(0, Branchtab29_l, 128/4); 
XPPPreload<2, Branchtab29_2, 128/4); 
XPPPreload(4, vp->old_me tries, 128/4); 
XPPPreload (5, vp->old_metrics+128, 128/4) ; 
XPPPreload(6, vp->newjrte tries, 128/2); 
XPPPreload(7, vp->dp->w, 8) / 



for(_i=Q;_i<8;_i++> { 
rise— 0; 

for (12=0; i2<32;12+=2) { 

unsigned char metric ,_tmp, m0,ml/_m0,_ml ; 

metric = ((*iramO++ A symlj + 

(*iraml++ A sym2) + l)/2; 
_tmp= (metric « 1). -15; 
mO - *iram2++ + metric; 
ml = *iram3++ + (15 - metric); 
jnaO = mO - _tmp; 
~ml = ml + Jtmp ; 

*iram6++ = (min(m0,ml) « 8) I min (_m0 , _ml ) ; 
rise « rise I ( mO >= ml) « i2 

I (_ m o jml) « (12+1) ; 



: 1 1 

ilis 
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*iram7++ « rise; 



The modified data flow graph (unrolling and splitting omitted for simplicity): 



Btab28_1 
iramO 




syml 
Irarol 




Biab29_2 
' lram2 




sym2 
Iram3 




cm 

J=[0..8) 



newmetrtcs 
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And here the splitting network for one IRAM: the bottom most level merge is omitted for each level of 
unrolling. 




Parameter 


Vahie 


Vector length 


123 


Reused data set size 




I/OERAMs 


.61+20 


ALU 1 


2*24+4*3(split)+2Gota>= 62 


BREG 


few 


FREG 


few 


Data flow graph width 


4 


Data flow graph height 


ll+3(split) 


Configuration cycles 


14+64 



5.4.5 Re-Normalization: 

The Normalization consists of a loop scanning the input for the minimum and a second loop that 
subtracts the minimum from all elements. There is a data dependency between all iterations of the first 
loop and all. iterations of the second loop. Therefore the two loops cannot be merged or pipelined 
They will be handled individually. 

Minimum Search 

The third loop is a minimum search on a byte anay. 

char *iramO ~ vp->newjne tries; // xpppreioad (0, vo->new metrics, 64/4)- 
for(i~0;i<64;±++) ~" 
minmeturic - min (mijunetric, *ixamO+4-> 
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Parameter 


Value 


Vector length 


64 


Reused data set si2e 


• 


I/O IRAM$ 


l+l 


ALU 


1 


BREG 


o 




0 


Data flow graph width 


1 


Data flow graph height 


1 


Configuration cycles 


64 



Reduction recognition eliminates the dependence forminmetrh enabling a four-times unroll to utilize 

cSrrS^ M J W, ' dth ° f 32 WtS ' A SpHt netw ° rk has to added t*> separate the 8 bit streams usin« 3 
SHIFT and 3 AND operations. Tree balancing redistributes the minO operations to minimize the tt«e 
height. 

char *iramO = vp->newjmetrics; // xPPPreload(O r vp->new metrics, 16) ; 
for (i=0;i<l'6;i++) - 

minmetric = min (minmetric, min( min(*iramO++, *iramO++),* 

min(*iramO++, *iraroO++) )); 



Parameter 


Value 


Vector length 


16 


Reused data set size 




I/OIRAMs 


' 11+10 


ALU 


4*min 


BREG 


3*shln+3*$hrn " 


FREG 


0 


Data flow graph width 


4 


Data flow graph height 


5 ' 


Configuration cycles 


5+16 



Reduction recognition again eliminates the loop carried dependence fat minmetric, enabling loop 
m? « UnroH andjam to increase parallelism; the maximum for the tiling size is 16 IRAMs / 2 

IRAMS = 8. Constant propagation and tree rebalancing reduces the dependence height of the final 
merging expression: ^ 



char 
char 
char 
char 
char 
char 
char 
char 



*iramO= 
*iraml= 
*irani2» 
*iram3« 
*±ram4=» 
*iram5= 
*iraia6= 
*iram7«= 



vp->new_ 
vp->new" 
vp->new~ 
vp->nevy_ 
vp->new_ 
vp->new_ 

vp->new 



metrics; 

metrics+8; 

metrics+16; 

.nietrics+24; 

metrics+32; 

met'rics-^O; 

metrics+4 8; 

metrics+56; 



// XPPPreload(0, vp->new_ 

// XPPPreloadd, vp->new" 

// XPPPreload(2, vp->new[ 

// XPPPreZoad(3, vp->new* 

// XPPPreIoad(4, vp->new~ 

// XPPPreload<5, vp->new" 

// XPPPreload (6, vp->new" 

// XPPPreload (7, vp->new" 



.metrics, 2); 
jnerrics+8, 2); 
metrics+16, 2); 
metrics+24, 2); 
metrics+32, 2) ; 
metricS+40, 2) ; 
metrics+48, 2) ; 
itietrics+SG, 2) ; 
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for(i=0; i<2,-i++) { 







uL2.n inia_niuet.ii cu 


/ 


min ( 


mi Tvmff> {* i? i 1_ 




lllJLIl \ HU-IUUC U-w 1U1 


9 


uijin i 






mill ^mjiiMUvtr J,v£ 


t 


min ( 


111JL& Uli'c; L JU X V* J 




tut n /m<S T^mea4*v*S r* ^ 
nix 1 1 iiuxruut; L,x 1CJ 


t 


min ( 


minmetric4 




min(minmetric4 


f 


min ( 


minmetricS 




min {minmetricS 


/ 


min ( 


minmetric6 


S3 


min/minmstricS 


/ 


min ( 


minmetric7 




min {minmetric7 


/ 


min< 



> 

minmetric = min( min ( (min(minmetric_0, 

min (minmetric_2 , 
min ( (min (minmetric_4 , 
min (minmetric_6, 



min(*iramO++ / 
min {*iramO++, 
min(*irarol++, 
min{*ifami++ 1 
min {*iram2++, 
miri(*iram2++, 
min(*iram3++, 
min(*iram3++, 
min(*iram4++, 
min(*iram4++, 
min(*iram5++, 
min(*iram5++, 
min(*iram6++, 
min (*iram6++, 
min(*iram7++, 
min(*iram7++, 

minmetric_l) , 
minmetric~3 > ) 
minmetA:ic_5) , 
minmetric 7) ) 



*iramO++) , 
*iramO++) 
*iraml++).', 
*iraml++) 
*iram2++) , 
*ir*am2-M : } 
*iram3++) , 
*iram3++) 
*iram4++) , 
*iram4++) 
*iram5++) , 
*iram5++> 
*iram6++) , 
*iram6++) 
*iram7++) , 
*iram7++) 



Parameter 


Value 


Vector length 


2 


Reused data set size 




I/O IRAMs 


81+10 


ALU 


8*4*min = 32 


BREG 


8*(3*shln+3*shin)=48 


FREG 


0 


Data flow graph width 


8*4=32 


Data flow graph height 


5 ■ 


Configuration cycles 


8+2 



Re-Normalization 

The fourth loop subtracts the minimum of the third loop from each element in the array. The read- 
modify-wnte operation has to be broken up into two IRAMs. Otherwise the IRAM ports will limit 
throughput. 



char *iramO= vp->newjmetrics; 
char *iraml= vp->new_metrics; 
for (i=0;i<64;i++) 

*iraml++ = *i rain o++ - minmetric/ 



// XPPPreload • (0, vp->new_metrics f 64/4) 
// XPPPreloadCleanCL, vp-> neurometries, 64/4) 
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Parameter 


Value 


Vector length 




- Reused data set size 




I/OlRAMs < 


2I+10 


ALU 


1 


BREG 


V 


FREO 


0 . 


Data flow graph width 


1 


Data flow graph height 


1 


Configuration cycles 


64 



char *iramO s » ira— >npw mof^ ^« . y/ v>r ,__, 

// XPPPreload (0, vp->new metrics, 
// Xpppreloadcieand, vp->ne»Tmetrics, 



char *iramO=< vp->new_metrics; 
char *iraml= vp->new metrics; 
for(i=0;i<i6;i++) { ~ 

*iraml++ = *iramO++ - minmetric; 

*iraml++ = *iramO+-K minmetric; 
iramX++ = *iram0+4- - minmetric; 

*iraml++ = *iram0++ - minmetric; 



16) 
16) 



Parameter 


Value 


Vector length 


16 


Reused data set size 




I/O IRAMs 


21+1° ~ 


ALU 


4*4(sub)= 26 


BREG 


6*shln+6*shrn- 12 


FREG 




Data flow graph width 


4 


Data flow graph height 


5 


Configuration cycles 


2(spUt)+4*l(sub)+2(join)=8 ~" 



limited by Ae B^cTZZ byZ L^J 1 ^ m K ™*£* the third loop, but loop.tiling is now 

64 BRE<Ll2 BREGs =T which SSS^T^J?" ""V" ^ Si2e ^ tt is 
overhead. ' ' 8 Kpl&Ced by 4 ' smce the same throughput is achieved with lis 



char 
char 
char 
char 
char 
char 
char 
char 



*iraraO= 
*iramlfe 
*iram2« 
*iram3t= 
*iram4= 
*iram5= 
*iram6= 
*iram7- 



vp->new_ 
vp->new] 
vp->new] 
vp->new B 
vp->new_ 
vp->new~ 
vp->neuT 
vp->new 



.metrics; 

.metrics; 

jnetrics+16,- 
.metrics*! g ; 

jnotrics+32; 
metrics+32; 
metrics+48; 
metrics+48; 



// XPPPreload (0,vjp 
// XPPPreioadClean(l,vp- 
// XPPPreload. (2,vp- 
// XPPPreloaclClean<3,vp. 
// XPPPreload (4,vp- 
// XPPPreloadCiean(5,vp. 
// XPPPreload (6,vp- 
// XPPPreloadClean(7,vp- 



->new_me tries , 4 ) 
->new_niQ tries, 4)- 
■>new_metries+16, 4 ) 

•>new_metrics+32, 4 ) 
>newjorietrics+32, 4 > 
>J *ew_metrics-t-4 8, 4) 
>new_metrics+48,4) 
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for(i=0;i<4;i++)- { 



*iraml++ 


_ 


*iramO++ 




minmetric; 


// 


first pipeline 


*iraml«M- 


=5 


*iramO++ 


— 


minmetric; 






*iranil++ 


mm 


*iramO++ 




minmetric; 






*iraml++ 


S= 


*iram04-+ 





minmetric; 






*iram3++ 


= 


*iram2++ 




minmetric; 


// 


seconcl pipeline 


+iram3++ 


_ 


*iram2++* 




minmetric; 




*iram3++ 


on 


*iram2++ 




minmetric; 






*iram3-f+ 


_ 


*iram2++ 




minmetric; 




• 


*irara5++ 


=3 


*iram4++ 




minmetric; 


// 


•third pipeline 


*iram5++ 


S3 


*iram4++ 




minmetric; 




*iram5+t 




*iram4++ 




minmetric; 






*iram5++ 




*iram4++* 




minmetric; 






*iram7++ 




*iram6++ 




minmetric; 


// 


fourth pipeline 


*iram7++ 




*iram6++ 




minmetric; 




*iram7++ 




*iram6++ 




minmetric ; 






*iram7++ 


S3 


*irara6*+ 




minmetric; 







} 



Parameter 


Value 


Vector length 


4 


Reused data set size 




I/O IRAMs 


5I+40. 


ALU 


4*(6(spIit)+4<sub)+6aom)) = 64 


BREG 


4*(6*shln+6*shrn)=48 


FREG 


0 


Data flow graph width 


16 


Data flow graph height 


1 


Configuration cycles 


2(split)+4» l(sub)+20oin)= 8 



5A6 Final Code 

Finally we arrive at the following code: 

int update_viterbi29(void *p, unsigned char syml, unsigned char sym2){ 
int in- 
struct v29 *vp - p; 

• unsigned char *tmp; 
int normalize ■» 0; 

// initialization loop eliminated • 
// for (i=0;i<8;i++) 

// . vp->dp->w(i] - 0; .... 
// Configuration for butterfly loop 

char *iram0=? Branchtab29_l; // XPPPreload<0, Branchtab29_l, 128/4); 

char *.iram2* Branchtab29_2; // XFFPreload<2, Branchtab29_2, 128/4); 

char +iram4= vp->oldjmetrics; // XPPPreload { 4 , vp->old_metrics, 128/4); 

char *iram5= tfp->old_metrics+128; // XPPPrelo*d (5, Trp->oia~me<crics-KL28, X28/4) 
short *iram6= vp->new_metrics; // XPPPreload<6, vp->nev/_me tries, 128/2); 
unsigned long *iram7= vp^>dp->w; // XPFFreload<7, vp->dp->w, 8); 
// syml & sym2 are in I RAM 1 & 3 
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for(_i=0;_i<8,-_i++) { 
rlse= 0; 

for (±2-0 ; i<32 ; 12+-2 ) • { // unrolled once 
unsigned char metric, _tmp, m0,ml, mO, ml 

metric = ( (*iramO++ A syml) 

(*iraml++ A sym2) + l)/2; 
_tmp« (metric « l) -15,- 
mo - *iram2++ + metric; 
-ml - *ir©ni3++ + (15 - metric); 
— mO = mO - _tmp; 
_ml = ml + tmp ; 

rSrI*^lL C> ! ,1 ? (,1 S'? ) 8) »"»in<-»0, ml); 
rise m rise I ( mO >= ml) « ±2 ~~ 

I <_m0 >=» _ml) « (i2+l); 
*iram7++ = rise; 



} 



// 



/* Renoraialize metrics */ 
if (vp->new_metrics[0) > 150) { 
int i; 

Configuration for loop 3 
char *iram0« vp~>new_metrics; 
char *iraml= vp->new_xnetrics+8; 
char *iram2= vp->new metrics+16; 
char *iram3« vp->new_metrics+24; 
char *iram4= vp->new_metrics+32; 
char *iram5^ vp->new^metrics+40; 
char *xram6* vp->new metrics+48; 
char *iram7- v P -.>new_metric s +56; 
for(i=0;_i<2;i++) { 
minmetricO 



// XPPPr<=a O ad<0, 
// XPPPreloadU, 
// XFPPreload(2, 
// XPPPrelcad(3 f 
// XPPPreload(4, 
// XPppreload(5 # 
// XPPPreload($, 
// XPFPreloadO, 



minmetricl 

minmetric2 

minmetric3 

minmetric4 

minmetric5 

minmetric 6 

minmetric7 



=» min(minmetricO 
■= min (minmetricl 
- min (minmetric2 
■ min(minmetric3 
= min (minmetricl 
mih (minmetricS 
558 min (minmetric 6 k 
- min (minmetric7 , 



vp->>newjnetrics r 8) ; 
vp- >new_me fc r i cs+ 8 , $) ; 
vp->new^metrics+16 r 8), 
vp->newjmet.rics+24 r B) ; 
vp->new_metrics+32, 8 ) ] 
vp->newjnetrics4.40, 8) ; 
vp->newjraetrics+48, 8) ; 
vp->newjcnetrics+56, 8 ) ; 



minmetric = niin( 



, min( min (*iram0++, 
min (*iram0++, 
' min( min (*iraml++, 
min(*iraml++, 
min( min(*iram2++, 
.min (*iram2++, 
min( min(*iram3++ # 
min{*iram3++ / 
min( min(*iram4++, 
min (*iram4++, 
min( min(*i r am5++ / 
min(*iram5++, 
min( min{*iram6++, 
min(*iram6-M-, 
min( min(*iram7++, 
min(*iram7+4 v 



// minmetric is 



min ( (min (minmetricj), 
min (minmetric 2, 
min ( (min (minmetric_4 , 
min (minmetric 6, 
written to the output IRAM 



*iramO++) , 
* iram0++ ) ) ) ; 
*iraml++) , 
*iraml++) j) ; 
*iram2++), 
*iram2++) )); 
*iram3++) , 
*iram3++) ) ) - 
*iram4++) , 
*iram4++) ) ) 
*iram5++) , 
*iram5++) ) ) ; 
*iram6++) , 
*iram.6++ ) ) ) - 
*iram7++) , 
*iram7++) ) ) ; 



minmetric 1) , 
minmetric^) ) , 
minmetric_5) , 
minmetric_7) ) ; 
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// Configuration for loop 4, minmetric 



} 



char *iramO= vp->new^metrics; // 

char *iraml~ vp->new3netrics; // 

char *iram2= vp->new_metrics+l6; // 

char *iram3« vp->new_metrics+16; // 

char *irani4« vp->new_metrics+32; // 

char *iraro5= vp->new_metrics+32; // 

char *iram6= vp->new_metrics+48; // 

char *iram7= vp->new_metrics+48; // 
£or(i=0;i<4;i++) { 
*iraml++ 
*iraml*+ 
*iraml++ 



is in an input IRAM 
XPPPreload (0, vp->new 
XPPPjreloadc.lean (1, vp->new 
XPPPreload (2, vp->new 

XPPPreloadClean (3, vp->new 
XPPPreload (4/ vp->new 

XPPFreloadClean (5, vp->new 
XPPPreload . (6,vp->new 
XPPPreloadClean (7 , vp->new 



*iramO++ 
*iramO++ 
*iramO++ 



*iraml+4. 
*iram3+4- 
*iram3++ 
*iram3++ 
*iram3++ 
*iram5++ 
*iram5++ 
*iram5++ 
*iram5++ 
*iram7++ 
*iram7++ 
*iram7++ 
*iram7++. 



*iramO++ 
*iram2++ 
*irairt2++ 
*iram2++ 
*iram2++ 
*iram4++ 
*irara4++ 
*iram4++ 
*iram4+4- 
*iram6++ 
*iram6-++ 
*iram6++ 
*iramS++ 



} 

normalize = minmetric; 



minmetric; 
minmetric; 
minmetric; 
minmetric; 
minmetric; 
minmetric; 
minmetric;* 
minmetric; 
minmetric; 
minmetric; 
minmetric; 
minmetric; 
minmetric; 
minmetric; 
minmetric; 
minmetric; 



vp->dp++; 

tmp = vp->old_merrics; 
vp~>oldjmetrics » vp->new_jnetrics; 
vp->new_metrics = tmp; 

return normalize; 



jne tries, 4) 
jnecricsi 4) 
>etrics+16,4) 
_metrics+16, 4) 
metrics +32, 4) 
jmetrics+32, 4) 
_metrics+48, 4) 
jnetrics+48, 4) 



// first pipeline 



// second pipeline 



// third pipeline 



// fourth pipeline 



Performance Considerations 

In this example we do not have a high data locality. Every input data item is read exactly once. Only in 
the case of re-normalization, the newjneiric array is re-read and re-written. To fully utilize the PAE 
array, loop tiling was used - in conjunction with reduction recognition to break dependencies using 
algebraic identities. In some cases (minimum search) this leads to extremely short vector lengths. This 
does not hurt as it still does reduce the running time of the configuration and the transfer time from the 
top of the memory hierarchy to the IRAMs stays the same. The vector length could be increased if the 
outer loop that calls the function was known - the additional data could be used to increase the fill 
grade of the IRAMs by unrolling the outer loop, keeping the vector length longer. This would further 
increase configuration performance by reducing overall pipeline setup times. 

Performance of XPP for this example is compared tt> a hypothetical superscalar RISC-architecture. We 
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assume an average issue width of two which means that the RISC on average executes two operations 
in parallel. The estimate is achieved by counting instructions for the sowceoo* TuT 5^.2 operatlons 
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5.5 MPEG2 encoder/decoder 

5.5.1 Quantization / Inverse Quantization (quantc) 

The quantization file contains routines for quantization and inverse quantization of 8*8 macro blocks 

uT 6 "52$!^ di 7f^I ntra md non - intra bIock s and furthermore the encoder distinguishes 
between MPEG1 and MPEG2 inverse quantization. uimngmsnes 

S'xIp^paSty by f fer fonCti0nS ' ^ *" candidates for fonctioQ since do not use 

Since all functions have the same layout (some checks, one main loop running over the macro block 
quantmng with a quantization matrix), we concentrate on -iquant intra", the inverse quantisation of 
mtra-blocks, since ,t contains all elements found in the o&er procedures (The non idSS. 
oop bodies are more complicated, but add no compiler complexity). In the source code the moeafuart 
is already mhned^ which is straightforward since the function is statically defined and Sins S 
SSn C ° mpUer inHneS k "* d6ad fonCtion e I^«on removeTtie wnole 

Original Code 

T id ^ i ? Uant - intra ( src ' da * ' dc _P r ®c, quant mat , mquant ). 
short *src, *dst; . — 

int dc_prec; 

unsigned ' char *quant mat ; 
int mquant; 

{ ; 

int i, val, sum; 

if (mpegl). { 

dstlQ] = src[0] « (3~de_prec) ; 
for (i=l; i<64; i++) 

I . . 

val = (int) (src[i]*quant_mat[i]*mquant}/16; 

/.* mismatch control V 
if ( (val&l)==0 && val!=0) 
val+«= (val>0) ? -1 ; l; 

/* saturation *'/ 

dst-ti] - (val>2047) ? 204? • ((val<-2048) ? -2048 : val); 

.'else 
{ • 

sum = dst[0) ■= src[0] « (3-dc prec) ; 
for (i=l; i<64; i++). ~* 
. i 

val a (int) {src[i]*quantjaat[i)*ntquant)/16; 

sum+= dst(ij = <val>2047) ? 2047 : ((val<-2048) ? -2048 : val); 

/* mismatch control */ 
if ( (sum&l)==0) 

dstt63] ~= 1; . . 
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} 

Interprocedural Optimizations 

Analysing the loop bodies shows that they easily fit to the XPP and do n6t use the maximum of 
resources by far* The function is called three times from module putseq.c. With inter-module function 
inlining the code for the function call disappears and is replaced with the function Therefore it reads 

for <k=0; k<mto_height*mb_jwidth; k++) { 
if (inbinfo[k)\mb_type & MBJENTRA) 
for (j«0; j<block_count; w j++) 
if (mpegl) { 

blocks [k*block_count+j J [OJ blocks [k*biock_count+j ] [0 J « 

C3-dc_prec) ; 

for (±-1/ i<64; i++) { 

val a (int) ( blocks [k*block_count+-j ] [i] *intra_q[i] *mquant)/l6; 

> " ; 
> else .{ 

sum = blocks [k*block_count+j ] [OJ = blocks rk*block_count+j] £0 J « 

(3-dc_prec); 

for (i-l; i<64; { 

val (int) { blocks [k*block_count+j Hi]* irrtra^q [i] *mquant) /16; 

> 

) ' ■ 

} else { 

> 

Basic transformations 

Since global mpegl does not change within the loop unswitching moves toe control statement outside 
the j loop and produces two loop nests. 

for (k^O; k<mb_height*mb_width; k++) { 
if (mbinfo[k] ,mb_type & MB_INTRA) 
if (mpegl) 

for (j=0; j<block_count> { 

blocks ( k+block_count+j ] [0] * blocks [k*block_count+j](0] « 

(3-dc jprec) / 

for. (1*1; i<64; i++) { 

val « (lnt)( blocks (k*block_count+j ] [i)*intra_q[ij*mquant)/16/ 

} 

else 

for ( j=0; * j<block_count; { 

sum - blocks [k*block_count+j] tOJ - blocks C k*block_count+j ] [0) « 

(3-dc prec) ; 

for (1=1; i<64; { 

val - (int) ( blocks [k*block_count+3 ] (ij* intra_q ' [i] *mquant) /16; 

> 
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Furthermore the following transformations are done: 

■ A peephole optimization reduces the divide by 16 to a right shift 4. This is essential since we do 
not consider loop bodies containing division for the XPP. 

* Idiom recognition reduces the statement after the "saturation" comment to 
.dst[i] « min(max(val, -2048), 2047) 

Increasing parallelism 

Now we want to increase parallelism. The j^i loop nest is a candidate for unroll-and-jam when the 
interprocedural value range analysts finds out that block_count can only get the values 6,8 or 12. 
Therefore it has a value range [6,12] with the additional attribute to be dividable by 2. Thus an unroll 
and jam with the factor 2 is applicable (the resource constraints would choose a bigger value). Since 
no loop carried dependencies exist, this transformation is safe. 

It is to say that the source code contains a manually peeled first iteration. This peeling has been done 
because the value calculated for the first block value is completely different from the other iterations 
and the control statement in the loop wpuld cause a major performance decrease on traditional 
processors. Although this does not prevent unroll-and-jam (because there are no dependencies between 
the peeled of first iteration and the rest of the loop), the transformation must be prepared to handle 
such cases. 

After unroll and jam the source code looks like (only one of the nests showed and the peeled first 
iterations moved in firont) 

for (1=0; j<bloek_count; j+=2) { 

blocks [k*count+j] [0] - blocks [k*count+j ] [0] « (3-dc_prec) ; 

blocks [ k*count+j+l.J [0] - blocks [ k* count +j+l] [0] « (3-dcprec); 
• for (±i-li i<64; { 

val - (int) (blocks tk*count+j] [ij *intra_q[i] *mbinfo[k] .mquant)»4; 

7* mismatch control */ 
if ((val&l)— 0 val!«0) 
val+= (val>0) ? -1 : 1; 

/* saturation */ 

blocks (k*count+j ] [i] » min (max (val, -2048), 2047); 

val - (int) (blocks [k*count+j+l] [i] *intra_q[i] *mbinfo [k] .mquant)»4; 

/* mismatch control */ 
if ((val&l)«=0 && vali-0) 
val+* (val>0) ? -1 : 1; 

/* saturation */ 

blocks [k*count+j+l] [i]' * min (max (val, -2048), 2047); 

} 

) * ' 

Further parallelism can. be obtained by index set splitting. Normally used to break dependence cycles 
in the DDG, it can here be used to split the i-loop in two and let two sub-configuration^ work on 
distinct blocks of data. Thus the i loop is split into 2 or more loops which work on different subsets of 
the data at the same time. 



5 sub-configuration is chosen as a working title for configurations which contains independent networks that do 
not interfere. 
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Handling the data types 

In contrast to the FIR-Filter, edge detector and matrix multiplication benchmarks, which all use data 
types fitting perfectly, to the XPP 6 , the MPEG2 codec uses all data types commonly used on a 
processor for desktop applications. Written for die fntel x86 and comparable architectures, we must 
assume that the sizes of char, short and int are 8,16, and 32 respectively. Assuming that the XPP has a 
bit width of 32 we must take precautions for the smaller data types- 

Therefore we split the stream of data packets with each packet containing 2 or 4 values of the shorter 
data type into 2 or 4 streams. If we have enough resources left, this will cause no performance penalty. 
Each of the divided streams is sent to its own calculation network; therefore tn every cycle two short 
or four char values are handled. Nevertheless this causes an area penalty, because besides the split- 
merge elements, the whole data flow graph has to be duplicated as often as needed. Figure 63 shows 
how short values are handled. The packet is Split into its hi- and lo part by shift operations and merged 
behind the calculation branches. The legality of this transformation is the same as with loop unrolling 
whh an unrolling factor as big as the data typ e is smaller as the architecture data type. 

Unfortunately this is not the end of the pole. The compiler further has to assure that every intermediate 
result which produces an over/under-flow for the shorter data type does the same with the bigger data 
type. Therefore it has to insert clipping operations which assure that the network calculates with real 
1 6 or 8 bit value, respectively. 



CDs, 




Figure 63 Splitting short values into two streams and merging them 
after the calc ulation, This method causes no performance penalty 



* We assume that the size of int is chosen to be the XPP architecture data bit width. Everything else would not 
lead to any feasible result ° 
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If the configuration size does not allow the whole loop body to be duplicated or dependencies prevent 
this, we still have the possibility to merge the split values again. This of course causes a performance 
penalty to the previous solution, because the throughput is only one (short) value/cycle now.Figure 64 
shows how the merge is done. Instead of streaming parallel through two networks the values are 
serialized and de-serialized again after the network. 








network 










_j£vaa Ccpcraar J 



Figure 64 Merging the split values before the network An event 
generator drives the merge and demux PAEs. This figure replaces the 2 
black boxes labeled "network" in Figure 63 



5.52 Inverse Discrete Cosine Transformation (idctc) 

The id«-algoritnm is used for the MPEG2 video decompression algorithm, ft operates on 8x8 blocks 
ta%nSS? re P reseo ! ation «* transforms them back intTtheir original s^nS 

SSsUf a t^nn-fiinction that calls idct for all blocks of a frequenV 

transformed picture to restore the original image. ^ y 

The idct function consists of two for-loops. The first loop calls idctrow - the second idctcol Function 
miming is able to eliminate the function calls within the entire loop nest structure so S/Sc 

^ZT ^f 00 03,18 anym0re - An0th «' ^ to 9* rid of fonction calls between the 

loop nest ls loop embedding that pushes loops fiom the caller into the callee. 

Original Code (idctc) 

^I^ W ^ d wwf si ? nal inver5e discrete cosine transform */ 
void idct (block) 

short *block; 

C • . 
int i; 

for <i»0; i<8; i++) 
idctrow (block+8*i) ; 

for (i=0/ i<8; i++) 

idctcol (block+i ) ; . 
) ' . 
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The first loop changes the values of the block row by row. Afterwards the changed block is further 
transformed column by column. All rows have to be finished before any column processing can be 
started. 



x idctrow 8 x idctaol result 




Dependency analysis detects true data dependencies between row processing and column processing. 
Therefore the processing of the columns has to be delayed until all rows are done. The innermost loop 
bodies idctrpw and idctcol are nearly identical. They process numeric calculations on eight input 
values (column values in case of idctcol and row values in case of idctcol). Eight output values are 
calculated and written back (as column/row). Tdcteol additionally applies clipping before the values are 
written back. This is why we concentrate on idctcol; 



/* column (vertical) IDCT 

* . ' 7 pi ' ' i. 

* dst(8*k] - sum cUJ * src{8*l] * cos ( — * ( k + - ) * l ) 

* 1-0 -82 

* 

* where: c[0] * 1/1024 

* c[1..7] - (1/1024) *sqrt (2) 

'*/•■ 

static void idctcol (blk) 

short *blk; 

< 

int xO, xl, x2, x3, x4, x5, x6, x7, x8; 



/* shortcut */ 

if (i((xl ~ (blJc[8*4]«8)) | (x2 =blk[8*6}) | 

(x3 - blk£8*2]) | <x4 - blk[8*l]) I (x5 - blk[8*7]) | 
^ (x6 ~blk[8*5]) I (x7 = blk[8*3])>) 

blk[8*0]^blk[8*l]«blk[8*2]«blk[8*3]=blk[8M]«blk[8*5i = 

blk[8*63=blk[8*7]*iclp[(blk[8*0]+32)»6); 
return; . 

xO - (blk[8*0]«8) + 8192; 

/* first stage */ 

x8 - W7*(x4+x5) + 4; 

x4 - (x8+(Wl-W7) *x4)»3; . 

x5 = (X8-(W1+W7)*x5)»3; 

x8 - W3*(x6+x7) + 4; 

x6 « (x8-(W3-W5) **6)»3; 

x7 « (x8-(W3+W5) *x7)»3; 



/* second stage *■/ 
x8 « xO + xl; 
xO -« xl; 

Xl = W6*(x3+x2) + 4/ 

x2 = (xl-<W2+W6)*x2)»3; 

x3 = {xl+(W2-W6)^x3)»3; 



c: m nf ane 5 7ftit 2 - Ju I i 16:22 



110 



xl = x4 + x6; 
x4 -= x6; 
x6 = x5 + x7; 
x5 -= x7; 

/* third stage */ 
x7 - x8 + x3; 
x8 -= x3; 
x3 = xO + x2f 
xO — x2; 

x2 * (181*(x4+x5)+128)»8; 
x4 = (181*(x4-x5)+128)»8; 

/* fourth stage */ 

blk[8*0] = iclp[ (x7+xl)»14] 

blk[8*l] m iclp[ (x3+x2)»14] 

blk[8*2] = iclp[ (x0+x4)»14] 

blk[8*3] - iclp[ <x8+x6)»14] 

- bik[8*4] - iclp[(x8- : -x6)»14] 

blk[8*5] « iclp[ (x0-x4)»14] 

blk.[8*6] m icXp[(x3-x2)»14] 

blk[8*7] - iclp[(x7-xl)»14] 



m 
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Wl W7 are macros for numeric constants that arc substituted by the preprocessor. The iclp array is 
used for clipping the results to 8-bit values. It is fully defined by the ihitjdct function before idct is 



called the first time: 

void init_idct{) 
( 

int -i; 

iclp*= iclip+512; 
" for (j~ -512; K512; i++) 
iclpti] « (i<-25€) ? -256 



t(i>255) ? 255 : i); 



A special kind of idiom recognition (function recognition) is able to replace the 
calculation of each array element by a compiler known function that can be 
realized efficiently on the XPP-tf the compiler features whole program memory 
aliasing analysis it is able to replace all uses of the iclp array with the call of the 
compiler known function. Alternatively a developer can replace the iclp array 
accesses manually by the compiler known saturation function calls. The 
illustration shows a possible implementation for saturate(val,a) as NML 
schematic using two ALUs. In this case it is necessary to replace array accesses 
like iclpfi] by saturate(i,256). 
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saturate(vaf,n) 



The /* shortcut*/ code in idctcol speeds column processing up if xl to x7 is zero. This breaks the 
wellrformed structure of the loop nest; The if-condition is not loop invariant and loop unswitching 
cannot be applied. But nonetheless - the code after shortcut handling is well suited for the XPP. It is 
possible to synthesize if-conditions for the XPP (speculative processing of both blocks plus selection 
based on condition) but this would just waste PAEs without any performance benefit. Therefore the 
/^shortcut*/ code in idctrow and idctcol has to be removed manually. The code snippet below 
shows the inlined version of the idctrow- loop with additional cache instructions for XPP control: 

void idct (block) 
Short *block; 

{' . . 

int i; 

XPPPraload(IDCTROW_CONFIG) ; // Loop Invariant 

for (i«0; i<8; i++> { 
short *blk; 

int xO, xl,,x2, x3, x4, x5, x 6, x7, x8; 
blk = block+8*i; 

XPppreloacUO, blk, 8) ; 

»PP*ei©^Cl^<l,r>lk,8>; // I RAMI is erased and assigned to blk 
. XPPExecutedDCTROW^CONFIG, IRAM(O) , IRAM(l) ) ; 

> 

for (i-«0; i<8; ±++) { 

mm* 

} 

As the configuration of the XPP does not change during the loop execution invariant code motion has 
moved out XPPPreload(rDCTROW_C.ONFIG) from the loop. 
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NML Code Generation 
Data Flew Graph 

As idqtcol is more complex due to clipping at the end of the calculations we decided to take idctcol as 
representative loop body for a presentation of the data flow graph. 

The figure on the next page shows the data flow graph for the IDCTCOLUMNjCONFIG. A heuristic 
has to be applied to the graph to estimate the resource needs on the XPP. In our example the heuristic 
produces the following results: 
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Fortunately the data flow graph fits into an XPP64 and we can proceed without loop dissevering 7 
(splitting the loop body into suitable chunks) for this example. 



7 XPP-VC: A C Compiler with Temporal Partitioning for the PACT-XPP Architecture, J.M P. Cardoso and 
Markus Weinhardt 
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Address Generation: 



To my synthesize the loop body we have tofece the problem of address generation for accessing the 
data. • 

For IDCTCOLUMNCONFIG we have to select the n* element 
of every row which means an address serial of (0,8,1.6... 1,9,17... 
7,15,23...). We use two counter macros for address generation as 
shown opposite. The upper counter increments by eight and the 
lower by one. The IRAM output is passed to the data flow graph 
of TDCTCOLUMN. If all (eight) row elements of a column are 
available SWAP is switched through to the data flow graph input 
and the calculation for a new column begins. 

For the IDCTROW_CONFIG the address generation is very 
simple as the IRAM already contains the block in the appropriate 
order (row after row as it has to be accessed). Again by using 
SIUP(stepped iterative up)-counter macros as described in the 
XPP tutorial it is possible to map linear address expressions to 
NML-code in a generic way. As IDCTROW_CONFIG accesses a 
two-dimensional array we need' two SlUP-counters in the 
corresponding NML code. The column-elements have to be 
accessed row after row so the upper counters increment is one and 
the lower counters increment is eight. However, the NML code 
for this access pattern (0...,5,6,7,8,9,...63) can be reduced to one 
single counter (or to FIFO-mode IRAM access). 

Address generation for write access is implemented in the same manner. The resources have to be 
updated to take this additional code into account. It takes 2*(8+8+2*l) FREGs and 2*(2+l> more 
BREGs in the worst case which is still available on the XPP. 

If IRAM use is not critical it is also possible to distribute the data on several iRAMs to improve the 
memory throughput into the XPP-array. This task has to be done by the RISC-core or by a more 
sophisticated XPP-cache controller. y 




Further Enhancing XPP Utilization 

As mentioned at the beginning idct is called for all data blocks of a video image (loop in transform.c). 
Tins circumstance allows us to further improve the XPP utilization. ™>™>h 

When we look at the date flow graph of idctcol in detail we see that it forms a very deep pipeline If 
IlSS?^ V% thC IDC TROW_CONFIG runs only eight times on &e XPP which 

meant that only 64 (8 tunes 8 elements of a column) elements are processed through this pipeline and 

2L2SnJ?^2?^- configuration to go on with column processing then it gets obvious that 

something is suboptimal in our example. 

Problem (Pipeline Depth) 

The pipeline is just too deep for processing only eight times eight rows. Filling 
and flushing a deep pipeline is expensive if only little data is processed with it 
First the units at the end of the pipeline are idle and then the units at the beein 
• are unused. 6 



|| DATA 


IDLE 


Pipeline Depth 


|«DU= 
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Solution (Loop Tiling) 

It is profitable to use loop interchange for moving the dependencies between row and column 
processing to an outer level of the loop nest. The loop that calls the idct-function (in transforms) on 
several blocks of the image is has no loop interchange preventing dependencies. Therefore this loop 
can be moved inside the loops of column and row processing. 



// transform. c 



for (n=* 0; n<block_count; n++) { 

idct (blocks [k*block count +n) ) ; // block count is 6 or 8 or 12 

} 



o 
o 
-a 

5" 
a> 

i 

<D 



// ictct.c 

/* two dimensional inverse discrete cosine transform */ 
void idct (block) 
short *block; 
{ 

int i; 



for (i«0; i<8; i++) 



idctrow<block+8*i> ; 
for (i=0; i<8; i++) 



idct col (block+i) ; 



Now the processing.of rows arid columns can be applied on more data (by applying loop tiling) and 
therefore filling and flushing the pipeline can be neglected. 

Constraints (Cache Sensitive Loop Tiling) 

The caching hierarchy has to be taken into account when we define the number of blocks that will be 
processed by the IDCTROW^CONFIG. Remember, we need the same blocks in the subsequent 
IDCTCOLUMN^CONFIG configuration! We have to take care that all blocks that are processed 
during IDCTROW_CONFIG fit into the cache. Loop tiling has to be applied with respect to the cache 
size so that the processed data fits into the cache. 

IRANI reuse between different configurations 

This example implies another bandwidth optimization that is just a 
more consequent version of loop tiling. Instead of 1 transferring data 
from row processing to column processing via the memory hierarchy 
(cache sensitive loop tiling takes care that only the cache memory is 
accessed) we can completely bypass the memory interface by using 
the output IRAM of Config A as input (RAM of Config B. 

Putting all together 



Input IRAM 



Wmm 



Shared IRAM 





Config 




A 
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Output IRAM 



Config 
B 
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If we apply cache sensitive loop tiling, IRAM reuse and function inlining we can further optimize our 
example: 

Finally the idct-function gets completely inlined in transfbrin.c. If block_couht is e.g. 6 and we assume 
that 64*6 words do not exceed the cache size then we can transform the example to: 

// transform, c 



block - blocks [k*6]; 
XPPPxeaoad (IDCTROW_CONFIG) ; 

XPPPreload(0,block764*6) ; // IRAMO gets 64 words from 6 blocks 

xppp:raloadClean(l, block, 64*6) ; // erase I RAMI and assign to the 6 blocks 
XPPExecute ( IDCTROW ^CONFIG, IRAM ( 0 ) , IRAM ( 1 ) ) ; 

XPPPMload(IDCOLUMN_CONFIG) ; 

XPPPx:el©ad(l, block, 64*6) ; // redundam: -> will be eliminated 

XPPExecutte ( IDCOLUMN_CONFIG, IRAM ( 1 ) , IRAM ( 2 ) ) ; 



The address generation in IDCTROW^CONFIG and IDCOLUMN^CONFIG has to be modified for 
reflecting the different data block size - caused by loop tiling - that has to be processed. This can be 
implemented by an additional SUP counter that generates the block offsets inside the tiles. 



lock offset 




blocK_count = 6 



The table contains architectural parameters for IDCTROW^CONFIO and IDCOLUMN_CONFIG of 
the final result. It relies on a cache that is able to store block__count blocks. As two configurations are 
executed in this example the configuration cycles have to be taken twice and therefore the total 
configuration cycles are 2 x (block_count x 64 + (12 + 2 x 8) x 2). 



Parameter 


Value 


Vector length 


Swords 


Reused data set size 


block_coum x 64 words 


I/OIRAMs 


3 (one shared) 


•ALU 


45 FUs 


BREG 


41 FUs 


FREG 


36 FUs 


Data flow graph width 


8 


Data flow graph height 


12 


Configuration cycles 


btockjxmnt x 64 + (12 + 2*8) x 2 
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Performance Considerations 

In tiiis example it is possible to exploit high data locality which means that many operations are 
performed on a limited memory range. The performance of the proposed XPP solution is compared to 
a hypothetical superscalar RISC-architecture. We assume an issue width of two which means that the 
KiSC executes on average two operations in parallel. 



LD/ST 
ADRCOMP 
ADO/SUB 
MULT 
SHIFT 
SAT 



Proc. Rows 
Prop. Cols 



Ops for Row/Column Est. RISC cycles 



16 
16 
35 

1.1 
18 
.8 



2 
1 



32 
16 



issue Width 



RISCCyc/Blk 
XPP Cyc/Blk 



1 35 

2 22 
1 18 
4 32 
2T5T 

Cyc/Row(Col) "T5" 

8 620 

8 620 

TZ3o~ 



Speedup 



3^8" with data duplication+reordering 24 
10 with data duplication+reordering 52 



Even though speedup is reasonable it gets obvious that fetching the input data from a sinale [RAM 
wh 1C h means that we have to feed the eight inputs in eight cycles beforeWssing S ri^I 
the potential speedup significantly. With other words we have a pipeline that is able to process elSt 
input valuer per cycle but we are loading the pipeline only every eighth cycle. TOs causes that ontv 
every e.ghth pipeline stage is filled. The figure below illustrates this- causes that only 



without with 
ata duplication data duplication 



Full utilization can be achieved only by loading the eight input values of the pipeline in one cvcle A 



Wl2^^^™£+ IRAM We " Se ^^—Itiple command to 
xePPrelo a d(0,bloc)c,64*6,, // IRAM0 gets 64 wor<is from g 

is changed to: 

XPPPreload^l t iple<OxFF, block, 64x6, // Wd IRAMO up to IRAM7 with blocks 
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Now the pipeline gets fully utilized and we also have to store eight results per cycle. This can be 
achieved by writing every output value to another IRAM which additionally takes eight more IRAMs 
(using data duplication in this example needs ail 16 IRAMs of the XPP64). For storing the data chat is 
generated with IDCTRO WjCONEIG we have to change: 

XPPPreloadClean (1, block, 64*6) ; // erase IRAM1 and assign to the 6 blocks 



to: 

tmpsize - 64*6/3; 
XPPPreloadClean ( 8, 
XPPPreloadClean { 9, 
XPPPreloadClean ( 10 , 
XPPPreloadClean (11/ 
XPPPreloadClean ( 12 , 
XPPPreloadClean (13, 
XPPPreloadClean (14, 
XPPPreloadClean ( 15 , 



bl oc k+0 * tmps i z e , 
bloc k+1 * tmps i ze , 
block+2*tnipsize, 
block+3*tmpsize, 
block+4*tmpsize, 
block+5*tmpsize# 
bloc k+6 * tmps i ze , 
block+7*tmpsi2:e, 



tmpsize) ; 
tmpsi2e) ; 
tmpsize) ; 
tmpsize) ; 
tmpsize) ; 
tmpsize) ; 
tmpsize) ; 



// 
// 
// 
// 
// 
// 
// 
// 



I RAM 8 
IRAM9 
IRAM10 
IRAM11 
IRAM12 
IRAM13 
IRAM14. for 
IRAM15 for 



for 
for 
for 
for 
for 
for 



interm - 
interm. 
interm. 
interm. 
interm, 
interm . 
interm. 
interm. 



Rslt 
Rslt 
Rslt 
RSit 
Rslt 
Rslt 
Rslt 
Rslt 



tmpsize) , 

This causes different data layouts for the intermediate results. We need an additional configuration 
(REORJDER_CONFIG) to restore the original data layout. 



IDCTROW_CONFlG 



IDCTCOLUMN_CONPtG 



REORDER^CONFIG 





IRAM13 



RAM 15 



OT03HOWBQC5 



1RAM1S 



I5MM5I 



Ro-0W>7 



Row? 

of Bk Ota DTK 3 



Again address generation has to be modified to fetch eight input values per cycle. This on the one hand 
requires seven additional adders, but on the other hand avoids swaps and latches for keeping the data 
eight cycles. 

Data duplication and data reordering finally transforms the example code to: 
♦// transform, c 



block - blocks [k*6]; 
XPPPsreload ( I DCTROW_C0NFIG ) ; 

XPPPreloadWult*.ple(OxFF, block, 64x6) // load I RAMO up to IRAM7 with 
tmpsize ■ 64*6/8; // result gets seperated into 8 IRAMs 



blocks' 



XPPPreloadClean ( 8, block+0*tmpsize, tmpsize) 
XPPPreloadClean < 9, block+l*tmpsize, tmpsize).; 

block+2*tmpsize, tmpsize) ; 



// IRAM8 for interm. 
// IRAM9 for interm* 
// IRAM10 for interm. 



// IRAM12 for interm. 
// IRAM13 for interm. 
// IRAM14 for interm. 
// IRAM15 for interm. 



XPPPreloadClean ( 1 0 , 

XPPPreloadClean ( 11 , block+3*tmpsize, tmpsize); // IRAMll for interm, 

XPPPreloadCloan^, block+4*tmpsize, tmpsize); 

XPPPreloadClean (13, block+5*tmpsize, tmpsize) ; 

XPPPreloadClean (14 , block+6*tmpsize, tmpsize); 

XPPPreloadClean { 15 , block+7*tmp3ize, tmpsize) ; 
XPPExeeute(IDCTROW_CONFIG, IRAM ( 0-7 } , IRAM (8-15) ) ; 

XPPPreloadd.DCOLUMN CONFIG) ; 

XPPPreloadM^txple(OxFF, block, 64x6) // Id- IRAM0-IRAM7 with interm. 

for interm. 

for interm. 

// I RAMI 0 for interm. 



XPPPar eloadCleajn. ( 8, block+0*tmpsize, 
XPPPreloadClean ( 9, blbck+l*tmpsize 
XPPPreloadClean (10, block+2* tmpsize, tmpsize) 



tmpsize); // IRAM8 
tmpsize) ; // IRAM9 



RSit 
Rslt 
Rslt 
Rslt 
Rslt 
Rslt 
Rslt 
Rslt 



Rslt 1 
Rslt 2 
Rslt 2 
Rslt 2 
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XPPPreloadCleanfll, block+3*tmpsize, tmpsize) ; // IRAMll for in term. Rslt 2 

XPPPreloa.dClean(a2, block+4* tmpsize, tmpsize) ; // IRAM12 for interm. Rslt 2 

XPPPreloadClean(13, block+5*tmpsize, tmpsize) ; // IRAM13 for interm. Rslt 2 

XPPPreloadClean(14, block+6*tmpsize, tmpsize); // IRAM14 for interm. Rslt 2 

- XPPFreloadCleariU5, block+7*tmpsize, tmpsi2e) ; // X RAMI 5 for interm. Rslt 2 
XPPExecute(IDCOLQMN_CONFIG, IRAM(0-7) , IRAM(8-1S) ) ; 

XPPPreload ( R50RDER_CONFIG ) ; 

XPPPreloadMulfcLplelOxFF, block, 64x6) // lei IRAM0-IRAM7 with interm, Rslt 2 
rsltsize « 64; // 64*6/6; 

XPFPreloadClean ( 8, block+0*rsltsize, rsltsize) ; // IRAM8 for final Rslt 

XPPPreloadClean { 9, block+l*rsltsize, rsltsize); // IRAM9 for final Rslt 

XPPPreloadCleanflO, block+2*rsltsize, rsltsize); // IRAM10 for final Rslt 

XPPPreloadClean(ll, block+3*rsltsize, rsltsize); // I RAMI 1 for final Rslt 

XPPPreloa<aciean(12 / block+4*rsltsize, rsltsize) ; // IRAM12 for final Rslt 

XPPPreloadCleantl3, block+4*rsltsize, rsltsize); // IRAM13 for final Rslt 
XPPExecute(IDCOLUMtt CONFIG, IRAM(0-7) , IRAM(8-13) ) ; 
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5.6 Wavelet 



5.6.1 Original Code 

void f orward_wavelet ( } 
{ 

int i,nt, *dmid; 

int *sp, *dp, d_tmpO, d_trapl, d_tmpi, s_tmpO, s_tmpl; 
int mid, ii; 
int *x; 

int s[256J,d[256J; 

for .(nt«COL;nt>=BLOCK_SIZ£;nfc»=l) { 
for (i=0;i<nt*COL/*tmp_nt*/;i+=COL) { 

x ~ &int_data[i] ; 
mid=(nt»l)-l; 

s[0] = x[0]; 
d[0] = x[ROW]; 

= x[2]; 
s [mid] « x[2*midj; 
dfmid] • x[2*mid+ROW] ; 

d[O] = (d[0)«l)-s[0] : -s(l]; . 
s[0]=s[0] + (d[0]»2); 

d_tmpO a d[0]; - 
s^tmpO = s [1] ; 

for(ii=l; ii<mid; ii++) { 
s_tmpl ^ x[2*ii+2]; 

d~tmpl = < (x[2*ii+ROW] - s_tmpO - s_tmpl; 

dTii] = d_tmpl; 

s[ii]- s_tmpO+( (d_tmp6+d_tmpl)>>3) ; 
d_tmpO = d_tmpl; 
s_tmpO » s_tmpl; 

) 

d [mid] « (d[mid] -s [mid] ) «1; 

s[mid]-s [mid]+( (d [mid-1] +d [mid]) »3) ; 

for(ii-0/ ii<=mxd; ii++) { 
x[ii]=s[ii] ; 
x[ii+mid+l]=d[ii] ; 

} 

> 

' for (i=0;i<nt;i++) { 

x = &int_data[i] ; 
mid*=(nt>>l)-l; 

s[0] = x[0]*; 
. d[0] = x[COL] / 

s[l] = x[COL«l); . 

s [mid] « x [ (COL«l) *mid] ; 

dfmidj = x[ (COL«l)*mid +COL] ; 
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d[0]-{d[0]«l)-s[0*]-s[l]; 
s[0]=s[0] + (d[0]»2); 

d_tmpO = d [0] ; 
s_tmpO » s(l] ; 
for(ii=l; iKmidr- { 
. s_tmpl = x[2*COL* ] ; 

d_tmpl «<x[2?COL*ii+COL]<<l) - s_tmp0 - s_tmpl; 

d[ii] = d_tmpl; 

s [ii] = s_tmpO+ ( (d^txnpO+d_tiapl) »3) ; 
d_tmpO =~d_tmpl; 
s_tropO - s_tmpl; 

> 

d(mid]^(d(mid]«l) -(s[midj«l) ; 
s[mid]«s [mid] + ( <d [mid-1] +d [mid] )>>3> ; 

for{ii=0; ii<=mid; ii++) { * 
x[ii*COL]-.sCii] ; 
■ x [ (ii+ndd+1 ) *COI0 =d [ii] / 

) 

} 



5.6-2 Optimizing the Whole Loop Nest 

After pre-processing and application of copy propagation over s_zmpl, d_tt?«s>2, we obtain the 
following loop nest. 

void forwardjdaveiet () 
( 

int i,nt, *dmid; 

int *sp, *dp, "djfcropO, d_tmpl, d_tmpi, s_tmp0, s__tmpl; 
int mid f ii; 
int *x; 

int s{256J,d[256J; 

for (nt«64;nt>= 16;nt»=l) { 
for (i*=G;i<nt*64,-i+=64) { 

x « &int_data[i] ; 
mid«(nt»l)-l; 

s[Q] = x[0]; * : 
• d[0] - x[U; 
stl] - k[2]; 
s[midj » X[2*mid]; 
d[mid) « x[2*mi<*+l] ; 

dt0]«(d[0]«l)-s[0]-*s[l]; 
s£0J=s(O] + (d[O3>>2) 

d_tmp0 d[0]; 
s_tmp0 - stl) ; 
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for{ii=l; ii<mid; { 

d[ii] =(<x[2*ii+lj)«i) - 5 tmpO - x[2*ii+2]; 
s[ii] = s_tmpO+ ( (d_tmpO + d[ii])»3); 
d_tmpO « d[ii] ; . .. . 

BjtmpO « s(ii] r 

) 

d [mid] = ( d [mid] -s [mid] ) «1 ; 

s[mid)~s [mid] + ( ( d [mid-1 ] +d [mid] )»3) ,- 

for(ii«0; ii<=mid; ii++) [ 
. x[ii]=s[ii] ; 
x[ii+mid+l]=d[ii]; 

} 



} 



for { i-0; i<nt ; i*+ ) { 
x = &±nt_data[i] ; 
mid=(nt>>l)-l; 



5[0] *= x[OJ; 
d[0] x[64]; 
s[l] = x[128J; 
s[mid] « x[X28*mld]; 
d[mid] * x[128*mid +64]; 

d[0J-(d[0]«l)-s[O]~s[l] ; 
s[OJ=»s[0] + (d[0J»2) ; 

djfcmpO - d[0j; 
s_tmpO - s[l] ; 

for(ii=l; ii<mid; ii++) { 

d[ii] = (x[128*ii+64]«l) - s^tmpO - x[128* (ii+1) ] ; 
s[ii]~ S_tmp0+ ( (d_tmpO" + d£ii])»3); 
d_tmpO = d[ii] ; 
s^tmpO ■» 3 [ii] ; 

} 

d[mid] = (d[mid]«l) - (s[mid]«l) / 

s [mid] =s [mid] + ( ( d fmid-1 ) +d [mid] ) »3 ) ; 

■for(ii=0; iiosmid,- ii++) .{ 
x[ii*641=s[ii]; 
xl (ii+mid+l)*64]«d[ii] ; 

> • 



Then we have 4 tables, one for each innermost loop. The tables for the first and the third loops are 
identical, as are the tables for the second and the fourth loop. We have the following two tables. 
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Parameter 


• Value 


Vector length 


rnid-2 


Reused data set stee 




I/OIRAMs 


<> 


ALU 


6 


BREG 


0 


FREG ~ " 


2 


Data flow graph width ^ 


2 


Data flow graph height ~ 


6 


Configuration cycles 


6+(mid-2) 



Parameter 


Value 


Vector length *~ ~™*™~ ~~— 
Reused data set size 


mid 


I/OIRAMs 


6 


ALU • ~™ 


0 


BREG 


0 


FREG ~~ ~ 


.0 


Data flow graph width 


2 


Data flow graph height ~* 


1 


Configuration cycles 


mid • 



2 .copse * ^ ^ ssst^'ssitsr 



for (nt=64;nt>= 16;nt»^l) < 
for (i=0;i<nt*64;i+=64) { 
x « &int_data [i] / 
mici=(nt»l)-X; 

s[0] « x[OJ ; 
dfOJ - x[lj ; 
S(1J « x(2J; 

s[mid)' « x[2*itiici] 
d(mid] - x[2*mid+lj 7 

d[0]«(dl0]«lj-8[0]-s[l] ■ 
s[0]t= S [0]^(d[0}»2)/ 



d^tmpO 
s^tmpO 



d[OJ; 
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for{ii=l; ii<mid; ii++) { 

=( (x[2*ii+l])«l) - s_tmpO - x[2*ii+2]; 
s[ii]~ s tmpO+( (d_tmpO 4- d[ii])»3); 
d_tmpO =~d[ii}; 
s^tmpO '=» s tii] ; 

} 

for(ii«l; ii<mid; { 

x[ii]~s[ii]; 

x[ii+mid+l]=d[ii]; 
} • 

d [mid] = (d [mid] -s [mid] ) «1 ; 

s [mid] *=s[mid) + { (d [mid-1 ]+d [mid] )»3) ; 

x[0]=s[0); 
x[mid+l]«*d [(>]•; 
x[mid]«a [mid] ; 
x[2*mid+l] = d[mid] ; - 



for (i=0;i<nt;i++) { 
x « &int_data [i3 ; 
mid=(nt»l)-l; 

s[0] - x[0]; 
d[0] - x[6A); 
s[l] - xU2BJ; 
s[mid] = x[128*micU; 
d[mid] » x[128*mid +64]; 

dI0) = (d[0]«l)-s[0]-s[l] ; 
s[0]=s[0] + (d[0]»2) ; 

d_tmpO = d[0] ; 
s_tmpO «. s [1] ; 
for(ii=l; ii<mid; ii++)* { 

d(ii] =(x[128 *ii+64]«l) - s tmpO - x[128 *(ii+l)]; 

s[ii]». s_tmpO+< (d tmpO+d_tmplT»3) ; 

d_rmpO = d[ii] ; 

s_tmpO = s[ii] ; 

} 

for(ii=l; ii<mid; ii++) { 
x[ii*64]=»stii] ; 
x[ (ii+mid+X)*64]=d[ii] ; 

} 

d[mid] = (d[mid]«l) -(s[mid]«l); 

s [mid] -s [mid] + ( (d [mid-1] +d [mid] ) »3) ; 

xtO]-s[0]; 

x((mid+l)*64]=d[0]; 

x[mid*64]=s [mid J ; 

x [ (2*mid+l ) *64 ] rd [mid) ; 



loop peeling the only change on the parameters is the vector length. The tables become: 



us 
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Reused data set size 



I/OIRAMs 



ALU 



BREG 



Data flow graph heig ht 
Configuration cycles 




6+(mid-2) 




for (nt=64;nt>= 16;nt»=l) { 

for (i » 0 ;i<nt*6-4 /*tmp_ntV; i + ,64> { 

x ■ &int data(i]; 
mici=(nt>>l)-i; 

stOJ = xfOj; 
d (0j « xfij. 
s fl] » x[2J; 
stmid] = x[2*raid],- 
dtmidj =- x[2*mid+l] - 



d [OJ = (drO]«l)- s[0 ]_ sm . 
s(0)= S [0j + (d [0J»2,; 



d_tinpO 
s^tmpO 



df 0) ; 
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for{ii~l; ii<mid; ii++) ( 

d[ii] - ((x[2*ii+l])«l) - s_tmpO - x[2*ii+2]; 
s[ii] - s_tmpO+( (d_tmpO + d[ii])»3); 
d_tmpO » d[ii] ; 
s tmpO = 3[ii] ; 
xTii] = s[ii]; 
x[ii+mid+l] = d[ii]; 

} 

d [mid] = (d [mid] -s [mid] ) «1 ; 

s [mid] -s [mid] + ( (d [mid-1] +d [mid] )»3) ; 

x(0]-s[0]; 
x[mid+l]=d[0] ; 
x[mid]=s [mid] ; 
x[2*mid+i]«= d[mid); 



for (A«0,-i<nt:,-i«-+) {• 

• x » Sint^dsttati] ; 
m±d=(nt >>!)-!; 

s[0] - x[0] ; 
d[03 * x[64]; 
s[l] - x[128]; 
s[tnid] m x[128*mid]; 
d[mid] - x[128*mid +64]; 

d[0]*(d[0]«l)-s[0]-5tl] ; 
s[0]=s[0] + (d[0]»2) ; 

d_tmpO - d[0] ; 
S_tmpO = s[l] ; 

for(ii=l; ii<mid/ ii++) { 
•d[ii) =(x[i28*ii+64]«l) - s_tmpO - x[128* (ii+1) ] ; 
s[ii]= s_tmpO+( (d_tmpO + d[ii])»3); 
d_tmpO - d[ii]; 
s~tmpO s[ii] ; 
< x[ii*64]=s[ii]; 

• x[ Cii+mid-fl)*64]=d[ii] ; 

) 

d[mid] = (d[mid]«l) -(s[mid]«l); 

s (mid] =s [mid] + ( { d [mid-i ] +d [mid] ) »3 ) ; 

x[O]-s[0]; 
x[(mid+l)*64]=d[0J ; 
x [mid* 64 ] =5 [mid] ; 
x [ ( 2*mid+l ) * 64 ] =d [mid] ; 

} 

> 



After loop fusion, we only have two loops, that have the same parameter table. 
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Parameter 


Value 


Vector length 


mid-2 


Reused data set size 


m 


I/OlRAMs 


8 


ALU 


6 


BREG 


0 


FREO 


2 


Data flow graph width 


2 


Data flow graph height 


6 


Configuration cycles 


6+<mid-2) 



When performing value range analysis, the compiler finds thatwr ranges takes the values 64, 32 and 
16. The upper bound of the inner loops is mid, which depends on the vahie of nL The analysis finds 
then that mid can take the values: 31, IS and 7. Loops with constant loop bounds can be handled more 
efficiently on the PACT XPP. This means that the inner loops can be better optimized if mid is 
replaced by a constant value. This will happen when the outer loop is unrolled. This way we will 
obtain a bigger code, but with 3 instances of the loop nest, each being candidate for a configuration. 
This can be seen as a kind of temporal partitioning. Thus the outer loop is completely unrolled giving 
six new loop nests. 

for (i«0;i<4096;i+=»64) { /* nt=»64 */ 

x « &xnt_data[i] ; 
mid»31; 



s[OJ - x[0], 
d[0] - x[l] 
s[l] - x[2] 
s{31] = x[61]; 
d[3l] = x[63J; 



d£0] = <d[0]«l)-stO]-$[l] ; 
.s[0)*s[0] + (d[0J»2) ; 



djtmpG 



dtO]; 



for(li="l; ii<31; { 

d[iij ~((x[2*ii+l])«l) - s_tmpO - x[2*ii+2]; 

s_tmpO+( (d_tmpO + dCii])»3); 
d_titip0 » d[ii] ; 
s^tropO « s[ii] ; 
x[ii]=s[ixj; 
xCii+32]=d{ii]; 

> 

d[31J=(d[31)-s[313 >«1; 
s[31]«s[31] + ((d£30]+ci(31] )»3) ; 
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x[O]-s[0]; 
x[32]=*d[0] ; 
x[31]«s[31]; 
Kt63]«d[3l]; 



for (i=0;i<64;i++) { 

x &int_data[i] ; 
xnid=*31,- 

*s(0] = x[OJ; . 
d[0] * x[64]; 
s [1) » x£128] ; 
s[31] - X [3968]; 
d[31] - x[4032}; 

d[0]*«i(0]«X)-stOJ-s[13; 

s [0]=s[0] + (d[0]»2) ; . * 

d_tmpO - d[0]; 
5_tmp0 = s[l]; 

for(ii«l; ii<3l,- ii++) { 

dtii] -»(xtl28*ii+€4]«i) - sjtmpo - x[128*(ii+l) 

S[ii)= a_tmpO+( (d_tmpO + d[ii])»3); 

d^tmpO = d[ii] ; 

s_tmpO = 

x[ii*64]=s[ii] ; 
^ x[ (ii+32) *64]=d[ii]; 

d[3l] = (d[31]«l) -(s[31]«l); 
8[31J=s[3l]+((d[30]+dt3U)»3); 

x[0}=3[0);. 
X[2O48]=d[0] ; 
x£1984]=s[31] ; 
x[4032]=dt31]; 



for Ci=O;i<2048 / -i+*64) { /* nt = 32 */ ' 

x - *im:_data[i] ; 
mid=15; 

s[0] = x[0]; 
d[0] - X[1J; 
s[l] - x[2]; 
S[15J « x[30J; 
d[15] = x[3l]; 

d[0]^(d[0]«l}- S [0]-s(l]; 
s[0)=s[0] + (d(O]»2) ; 

d_tmpO = d[0] ; 
s_tmpO s[l] ; 
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for(ii=i ; ii<l5; ii++) { 

dtii] = (<x[2*ii+l])«l) - s tmpO - xf2*ii+21- 

slli]^ s_tmpO+f(d tmpO + dflil)»3). J ' 

d__tmpO = d[ii]; " 

s^tnipO = s[iij; 

x[ii]*s[ii]; 
^ x(li+16]=d[ii] ; 

<l[15] = (dtl5J-s[15])«l; 

s [ 15 ] »s [ 15 ]_+ { (d r 14 ] +d [ 15J > »3 ) ; 

Jt[0]=sfO]; 
x[16)-=d[0]/ 
X[15]=sC15J 
^ Xt31]=d[15); 

for (i=0;i<32;i++) { 

x = &int_data[i]; 
mid=15; 

s[OJ = x£0]; 
d[03 = x[64); 
" sfl] <= x[128J; 
sflSj <= x[1920]; 
d[15] = XT1984]; 

dCO]={d[0]«l,- s[ O3- sU]; . 
S[0]=s[0] + <d[0]»2); 

d_tmpO = d[01/ 
s_tmpO = SflJ; 

for(ii«=l; ii<i5; ii++) { 

d_tmpO = d[ii] ; , 
S__trapO - s[ii]; 
xtii*64]=sfii],. 
x[{ii+i6)+64J=d[ii] ; 

.•dU5J-(d[15]«i) -(sfl5j«lj, 
Sfl5]= st i5] + <(d[14] + dtl5j)>>3) ; 

XCO)=. S fO]; 
xfl024]=d[0]; 
X[960]=s[15]; 
^ xfl984]=d[15j ; 

for Ci-O;±<f024,±+-64) ( /* nt = lg 

x - 6int_data(ij ; 
mid- 7 ; 

StO] = xtO] ; 
d[0] - X[l); 
Sfl] = x£2] ; 
s{7] - X[14] ; 
d[7J = x(15]; 

WO 
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d(O] = (d[O]«i)-s[0]-5(l] ; 
s[0']=s[03 + (cl[01»2) ; 

_ d_tmpO = d[0] ; 
s_tmpO = s [1] ; 

for(ii=l; ii<7; ii++) { 

d[ii] = ( (x[2*ii+l) - s_tltipO - x[2*ii+2j; 

s[ii] = s_tmpO+((d tmpO + d[ii])»3); 
d_tmpO = d[ii] ; 
S^tmpO «= s [ii] ; 
x~fiAJ=stii] / 
x[ii+8)=d[ii] ; 

} 

d[7] = (dC7]-s[7]).«l; 
. a[7]=s[7] + ((d[6]+d[7])»3).; 

x[0]=s[Ol; 
x[8]«=d[0] ; 
x[7]=s[7J; 
. x[15] = d[7]; 
} 

for (i=0;i<16;i++) { 

x = &int data[i] ; 
mid=7 ; *~ 

s[0] - x[0]; ' 

d[0] - x[64] ,- 

sll] « x[128],- 

s[7] = x[896]; 

d'[7] = x[960]; 

d[0] = (dtO]«l)-s[0]-s[l); 
s[0]=-s(0] + (d[0]>:>2) ; . 

d_tmpO = d£Oj; 
S_tnip0 ~ s [1] ; 

for(ii=l; ii<7; ii++) { 

d[ii) =(xtl28*ii+64)«l) - s_tmpO - x[128* (ii+1) ] ; 

s[ii]= s_tnipO+ ( (d_tmpO + d(ii])»3>; 

d_tmpO = d[ii] ; 

s_tmpO » s [ii] ; 

x[ii*64]=s[ii] ; 

xt(ii+8)*64]=d[ii); 

} 

d[7] = (d(7]«l) -(s[7]«l); 
s[7J«s[7] + ((d[6]+d[7))»3); 

x[0]=s[0); 
x[512]=d[0] ; 
x[448]=s[7) ; 
x[960]=d[7] / 



In the parameter table, the vector length is the only value that change. We give it for the first two 
loops. To deduce the table for the other loops, the vector length has to be set to 14 and 6 respectively. 
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5.6.3 Optimizing the Inner Loops 



^^?«T2TL" * e Sfa «-« ^ In ft* if we look a, * ,, 

First loop: 

dUl =({xf2*i i+1])<<1J l 
sui]^= s_tm P o+((d tmpO 
d_tmpO _ dfiij ; - P 
s_tmpO « Sfii}; 

- s[i±J; 

dtii+lj -((*f2*{ii+i )+in<<1 . 
s_tmpO = s[ii+i); 

xMi+lJ * Sfii+lj; 

xMi+33j a d[ii+ij ; 



s_tmpO - xf2*ii+21; 



- 8 trripO - xll28*(i± + i )]; 

dfiX])»3); ' J ' 



) 



) 

Second loopr 

a Ml] = s_tit. P 0+((d tmpo + 
d^tmpo =. dfiij; - * 
s_tm P 0 = s[ii] ; 
xMi-64] = SC11J; 
x[ (ii+32.) *64] . d[iii • 
dfii+ij -Cx[l28*(in.m-„ 

d^tmpo - dtii+i]. ~ P0 + d Mi+lJ)»3); 
s_tmpO - sUi+lJ; 

x[(xx+33)*64] = dfii+1 ■ . 
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Third loop: 

s[ii) - s _tmp04.((d_tmp0 + dliil)»3); 

d_tmpO - d[ii) ; 

s_tmpO = s[ii] ; • • 

x[iil » &ILU; 

" s -*«p0 " HC2Mii + D + 2]; 
s[ii+l] - s_tmpO+( (djimpO + dtii+l])»3); 
d_tmpQ •« d[ii+l]; 
s_tmpO = stii+1]/' 
x[ii+l] ■ S[ii+U; 
x[ii+17] - d[ii+l; 

) 

Fourth loop: 

for(ii=l; ii<15; ii-ii+2) I ■ vn , fl , M . +in . 

s[ii] - s - tmpO+( (d_tmpO + d[ii])»3); 
d_tinpO =» d[ii] 7 " 
s tmpC - s(ii] ; 
xTii*64] « stiilf " 

xt(ii+16)*64) » dtiil; . fl±+2 n * 

d[ii+l] - (x[128*(li+l)+«4]«l) sJmpO - x[129*(ii+2)Jr 

s£ii] - s_tiapO+((dJ:xnpO + d[ii+l])»3); 

d_tmpO » d[ii+l] ; 

5 _tmpO = s[±i+i] ; 

x[(ii+l)*"64] - s[ii+ll; 

x[ (1x4-17) *64] - d[±i+l]; 



- s tmpO - xC2*ii+2]; 
d[ii3)»3); 



Fifth loop: 

for(ii-L; #.i<7; ii*di+2) { 
. d[ii] « ((x[2*ii+l])«D 
s[ii] s_tmpO+ ( (djimpO + 
d_tmpO - d[ii]/ 
s~t™P° = s[ii] ; 
x[ii] - 

d[ii+l] - t(x[2Mii+l)+l])«D - s_tmpO - x[2*(iitl)+2]; 

s[ii+l] - s_tmpO+((d_tmpO + d[i-i+l] ) »3> ; 

d_tmpO « dtli+l) ; 

s tmpO- s[iH-l] ; 

xTii+13 83 s[i±+l]f 

x[ii+9] - dtit+1); 



EmPfansszeit 2-Joli 16:22 



133 



Fehterl Unbekanntes Schaltefaraumeni.&e«iltv^Stfmma<y 



Sixth loop: 

for(ii=l; ii<7; ii-ii+2) { 

d[ii] = (x[128*ii+64]«l) - s_tmpO - x [128* (ii+1) 3 ; 
s[ii] •= s_tmpO+( (d_tmpO + d[ii])»3); 
' d^tropO « d [ ii ] ; 
s_tmpO — s [ii] ; 
x[ii*64] « s[ii]/ 
x[(ii+8)*64] = d[ii); \ 

d[ii+l] = (x[128*(ii+l)+64]«l) - s_tmpO - x[128*(ii+2) ] ; 
s[ii] = s_tmpO+((d tmpO + d(ii+l]>»3); • 
■ ■ d_tropG « d[ii+i]; 
s_txnpO * s[ii+lj; 
x[(ii+l) *64] - s[ii+l); 
x[(ii+9)*64J «d(ii+l); 

> 

We obtain the following dataflow graph of these loop bodies after a step of tree balancing has been 
performed. We represent here only the graph corresponding to the firet loop. To obtain the graphs for 
the other loops, only the inpuc and output data need to be changed. 
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so 





\ . 


X<2l+1) 




x(2i+2) 




merge 




Each input and output data will occupy an IRAM dO and sO will be the only values in their IRAM, 
enabling then the merge operations to select between rf0, resp. sO at the first iteration and the feedback 
values for the other iterations. Once the pipeline is filled, 8 values can be output in a cycle, 
corresponding to 4 values for anay x. The same configuration is used for all loops; only the data in the 
IRAMs differ. We give now result tables only for the 2 first loops. The other tables are the same. 

For the first two loops we obtain the following table, and the expected speedup with respect to a 
standard superscalar processor with 2 instructions issued per cycle is 15.3. 



EmPf ansszei t 2. Jul i 16:22 



139 



Fehieri Un teKanntes SdialterarQumen t Bcccutivo Stmraary 



Parameter 


Value 


Vector length 


30 


Reused data set size 


- 


I/O IRAMs 


14 


ALU 


12 


BREG 


0 


FREG 


2 


Data flow graph width 


2 


Data flow graph height 


10 ' 


Configuration cycles 


10+15=25 




Ops 


Number 


LD/ST (2 cycles) 


14 


ADDRCOMP O cycle) 


2 


ADD/SUB (l' cycle) 


17 


MUL (2 cycles) 


0 


SHIFT (1 cycle) 


4 


Cycles per iteration 


. 51 


Cycles needed for the loop (2- way) 


(51*15)^=383 
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1 Introduction . 

This document describes a method for compiling a subset of a high-level programming language (HLL) 
like C or FORTRAN, extended by port access functions, to a reconfigurable data-flow processor (KDrir) 
as described in Section 3. The program is transformed to a configuration of the RDFP. 
This method can be used as part of an extended compiler for a hybrid architecture consisting of standard 
host processor and a reconfigurable data-flow coprocessor. The extended compiler handles a full HLL 
like standard ANSI C. It maps suitable program parts like inner loops to the coprocessor and the rest 
of the program to the host processor. It is also possible to map separate program parts to separate 
configurations. However, these extensions are not subject of this document. 

2 Compilation Flow 

This section briefly describes the phases of the compilation method. 
2.1 Frontend 

The compiler uses a standard frontend which translates the input program (e. g.aC program) into an in- 
ternal format consisting of an abstract syntax tree (AST) and symbol tables. The frontend also performs 
well-known compiler optimizations as constant propagation, dead code elimination, common subexpres- 
sion elimination etc. For details, refer to any compiler construction textbook like [1]. The SUIF compiler 
[2] is an example of a compiler providing such a frontend. 

Z2 Control/Dataflow Graph Generation 

Next, the program is mapped to a control/dataflow graph (CDFG) consisting of connected RDFP func- 
tions. This phase is the main subject of this document and presented in Section 4. 

23 Configuration Code Generation 

Finally, the last phase direcdy translates the CDFG to configuration code used to program the RDFP. For 
PACT XPP™ Cores, the configuration code is generated as an NML (Native Mapping Language) file:" 

3 Configurable Objects and Functionality of a RDFP 

This section describes the configurable objects and funcitionaliry of a RDFP. A possible implementation 
of the RDFP architecture is a PACT XPP™ Core. Here we only describe die minimum requirements for 
a RDFP for this, compilation method to work. The only data types considered are multi-bit words called 
data and single-bit control signals called evenis. Data and events are always processed as packets, cf. 
Section 3-2. Event packets are called 1-events or O-events, depending on their bit-value. 
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3.1 Configurable Objects and Functions 



An RDFP consists of an array of configurable objects and a communication network. Each object can 
be configured to perform certain functions (listed below). It performs the same function repeatedly until 
the configuration is changed. The array needs not be completely uniform, i. e. not all objects need to be 
able to perform all functions. E. g., a RAM function can be implemented by a specialized RAM object 
which cannot perform any other functions. It is also possible to combine several objects to a "macro" to 
realize certain functions. Several RAM objects can, e. g. , be combined to realize a RAM function with 
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Figure 1 : Functions of an RDFP 

The following functions for processing data and event packets can be configured into an RDER See Fig. 1 
for a graphical representation. 

• ALUfopcode]: ALUs perform common arithmetical and logical operations on data. ALU func- 
tions ("opcodes") must be available for all operations used in the HLL. 1 ALU functions have two 
data inputs A and B> and one data output X. Comparators have an event output U instead of the 
data output. They produce a 1 -even t if the comparison is true, and a O-evenc otherwise. 

• Otherwise programs containing operations which do not have ALU opcodes in the RDFP must be excluded from the 
supported HLL subset or substituted by "macros" of existing functions. 
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o CNT: A counter function which has data inputs LB, UB and INC (lower bound, upper bound 
and increment) and data output X (counter value). A packet at event input START starts the 
counter, and event input NEXT causes the generation of the next output value (and output events) 
or causes the counter to terminate if UB is reached. If NEXT is not connected, die counter counts 
continuously. The output events U, V, and W have the following functionality: For a counter 
counting N times; N-l O-events and one 1 -event are generated at output U. At output V, N 0-eveats 
are generated, and at output W, N O-events and one 1-event are created. The 1-event at W is only 
created after the counter has terminated, i. e. a NEXT event packet was received after she last data 
packet was output 

o. RAM[size]: The RAM function stores a fixed number of data words ("size"). It has a data input 
RB and a data output OUT for reading at address RD. Event output ERD signals completion of 
the read access. For a write access, data inputs WR and SN (address and value) and data output 
OUT is used. Event output BWR signals completion of the write access. ERD and EWR always 
generate O-events. Note that external RAM can be handled as RAM functions exactly like istemal 
RAM. . 

o GATE: A GATE synchronizes a data packet at input A back and an event packet at input E. When 
both inputs have arrived, they are both consumed. The data packet is copied to output X, and the 
event packet to output U. 

o MUX: A MUX function has 2 data inputs A and B, an event input SEL, and a data output X. If 
SEL receives a O-event, input A is copied to output X and input B discarded. For a 1-event, B is 
copied and A discarded. 

; o MERGE: A MERGE function has 2 data inputs A and B, an event input SEL, and a data output X 
... If SEL receives a O-event, input A is copied to output X, but input Bis not diseased. The packet 
is left at the input B instead. For a 1 -event, B is copied and A left at she input. 

o DEMUX: A DEMUX function has one data input A, an event input SEL, and two data cutouts X 
and Y. If SEL receives a O-event, input A is copied to output X, and no packet is created at output 
Y. For a 1-event, A is copied to Y, and no packet is created at output X. 

o MDATA: A MDATA function muMpIscates data packets. It has a data input A, an event input 
SEL, and a data output X. If SEL receives a 1-event, a data packet at A is consumed and copied 
to output X. For all subsequent O-event at SEL, a copy of she input data packet is produced at the 
output without consuming new packets at A. Only if another i-event arrives at SEL, the next data 
packet at A is consumed and copied. 2 Z 

packets from outside the RDFP through input port "name" and 
copies ther* i to data output X. If a packet was received, a O-event is produced at event output U, 
too. (Note that this function can only be configured at special objects connected to external busses.) 

o OUTPORT[name]: Sends data packets received at data input A to the outside of the RDFP through 
output port "name". If a packet was sent, a O-event is produced at event output U, too. (Note that 
this function can only be configured at special objects connected to external busses.) 

Additionally, the following functions ma nipulate only event packets: 
J Note that ibb can be implemented by a MERGE with special properties on XPP™ . 
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A Method for Compiling High-Level Language Programs to a Reconfigtirable Data-Flow Processor 5 

• O-HLTER, I -FILTER; A FILTER has an input E and an output U. A O-FBLTER copies a O-evem 
from EtoU, but 1 -EVENTs at E are discarded. A 1 -FILTER copies 1 -events and discards 0-events. 

• INVERTER: Copies all events from input E to output U but inverts its value. 

• 0-CONSTANT, 1-CONSTANT: 0-CONSTANT copies all events from input E to output U, but 
changes them all to value 0. 1-CONSTANT changes all to value 1. 

• ECOMB: Combines two of more inputs El , E2, E2L., producing a packet at output U. The output 
is a 1 -event if and only if one or more of the input packets are 1 -events (logical or y a packet must 
be available at all inputs before an ouput packet is produced. 3 

• ESEQfseqJ: An ESEQ generates a sequence "seq" of events, e.g. "0001", at its output U, If it 
has an input START, one entire sequence is generated for each event packet arriving at 13. The 
sequence is only repeated if the next event arrives at U. However, if START is not connected, 
ESEQ constantly repeats the sequence. 

Note that ALU, MUX, DEMUX, GATE and ECOMB functions behave like their equivalents in classical 
dataflow machines [3, 4]. 



3.2 Packet-based Communication Network 

The communication network of an RDFP can connect an outputs of one object (i. e. its respective func- 
tion) to the input(s) of one or several other objects. This is usually achieved by busses and switches. By 
placing the functions properly on the objects, many junctions can be connected arbitrarily up to a limit 
imposed by the device size. As mentioned above, all values are communicated as packets. A separate 
communication network exists for data and event packets. The packets synchronize the functions as in a 
dataflow machine with acknowledge [3]. I. e., the function only executes when all input packets are avail- 
able (apart from the non-stria exceptions as described above). The function also stalls if the last output 
packet has not been consumed. Therefore a data-flow graph mapped to an RDFP self-synchronizes its 
execution without the need for external control. Only if two or more function outputs (data or event) are 
connected to the same function input ("N to 1 connection**), the self-synchronization is disabled. 4 The 
user has to ensure that only one packet arrives at a time in a correct CDFG. Otherwise a packet might 
get lost, and the value resulting from combining two or more packets is undefined However, a function 
output can be connected to many function inputs 0*1 to N connection") without problems. 
There are some special cases: 

• A function input can be preloaded with a distinct value during configuration. This packet is con- 
sumed like a normal packet coming from another object 

• A function input can be defined as constant. In this case, the packet at the input is reproduced 
repeatedly for each fu nction execution. 

3 Note that this function is implemented by the EAND operator on the XPP 11 * . 

*Notc chat on XPPTM Cores, a tt Ntol connection" for events is realized by the EOR function, and for data by just assigning 
• several outputs to an input. * 
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An RDFP requires register delays in die dataflow. Otherwise very long combinational delays and asyn- 
chronous feedback is possible. We assume that delays are inserted at the inputs of some functions (like 
for most ALUs) and in some routing segments of the communication network. Note that registers change 
the timing, but not the functionality of a correct CDFG. 

« 

4 Configuration Generation 
4.1 Language Definition 

The following HUL features are not supported by the method described here: 

• pointer operations 

• libraiy calls, operating system calls (including standard I/O functions) 

• recursive function calls (Note that non-recursive function calls can be eliminated by function in- 
lining and therefore are not considered here.) 

• All scalar data types are converted to type integer. Integer values are equivalent to data packets in 
the RDFP. Arrays (possibly multi-dimensional) are the only composite data types considered. 

The following additional features are supported: 

1NPORTS and OUTPORTS can be accessed by the HLL functions getstreamfname, value) and put- 
stream(name> value) respectively. 

42 Mapping of High-Level Language Constructs 

This method converts a HLL program to a CDFG consisting of the RDFP functions defined in Section 3. 1 • 
Before the processing starts, all HLL program arrays are mapped to RDFP RAM functions. An array x 
is mapped to RAM RAM(x). If several arrays are mapped to die same RAM, an offset is assigned, too. 
The RAMs are added to an initially empty CDFG. There must be enough RAMs of sufficient size for all 
program arrays. 

The CDFG is generated by a traversal of the AST of the HLL program. It processes the program state- 
ment by statement and descends into the loops and conditional statements as appropriate. The following 
two pieces of information are updated at every program point 5 during the traversal: 

• START points to an event output of a RDFP function. This output delivers a O-event whenever 
the program execution reaches this program point At the beginning, a 0-CONSTANT preloaded 
with an event input is added to the CDFG. (it delivers a O-event immediately after configuration;) 
START initially points to its output This event is used to start the overall program execution. The 
STABT n€W signal generated after a program part has finished executing is used as new START 
signal for the following program parts, or it signals termination of the entire program. The START 

5 ln a program, program points are between two statements or before the beginning or after the end of a program component 
like a loop or a conditional statement. 
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A Method for Compiling High-Level Language Programs to a ReconSgurable Data-Flow Processor 7 

events guarantee that the execution order of the original program is maintained wherever the data 
dependencies alone are not sufficient. This scheduling, scheme is similar to a one-hot controller 
for digital hardware. 

o VARLIST is a list of {variable, function-output} pairs. The pairs map integer variables or array 
elements to a CDFG function's output. The first pair for a variable in VARLIST contains the 
output of the function which produces the value of this variable valid at the current program point 
New pairs are always added to the front of VARLIST. The expression VARDEF(var) refers to the 
junction-output of the first pair with variable var in VARLIST. 6 

The following subsections systematically list all HLL program components and describe how they are 
processed, thereby altering the CDFG, START and VARLIST. 

&2J. tateger Expiressioms aum<3 AssSgnmHEenfts • 

Straight-line code without array accesses can be directly mapped to a data-flow graph. One ALU is 
allocated for each operator in die program. Because of the self-synchronization of the ALUs, no explicit 
control or scheduling is needed. Therefore processing these assignments does not access or alter START. 
The data dependences (as they would be exposed in the DAG representation of the program [1]) are 
analyzed through the processing of VARLIST. These assignments synchronize themselves through the 
data-flow. The data-driven execution automatically exploits the available instruction level parallelism. 

All assignments evaluate the right-hand side (RHS) or source expression. This evaluation results in a 
pointer to a CDFG object's output (or pseudo-object as defined below). For integer assignments, the 
left-hand side (LHS) variable or destination is combined with she RHS result object to form a new pair 
{LHS,result(RHS)} which is added to the from of VARLIST. . 
The simplest statement is a constant assigned to an integer 7 
a - Si; 

It doesn't change the CDFG, but adds {a, 5} to the front of VARLIST. The constant 5 is a "pseudo- 
object" which only holds the value, but does not refer to a CDFG object. Now VARDEF(a) equals 5 at 
subseqem program points before a is redefined. 

Integer assignments can also combine variables already defined and constants: 
b - a * 2 + 3; 

In the AST, the RHS is already converted to an expression tree. This tree is transformed to a combination 
of old and new CDFG objects (which are added to the CDFG) as follows: Each operator (internal node) 
of the tree is substituted by an ALU with the opcode corresponding to the operator in the tree. If a leaf 
node is a constant, the ALU's input is directly connected to that constant. If a leaf note is an integer 
variable var, it is looked up in VARLIST, i. e. VARDEF(var) is retrieved. Then VARDEF(var) (an output 
of an already existing object in CDFG or a constant) is connected to the ALU's input The output of the 
ALU corresponding to the root operator in the expression free is defined as the result of the RHS Finally 
a new pair {LHS, result(RHS)} is added to VARLIST. If the two assignments above are processed, the 

*Thls method of using a VARLIST is adaprcd from the Transmogrifier C compiler [5J. 
'Note that we use C sjuiax for the following examples. 
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CDFG with wo ALUs in Fig- 2 is creaied. 8 Outputs occurring in VARLIST are labeled by Roman 
numbers. After these two assignments, VARLIST = [{b, I}, ..{a, 5}]. (The front of the list is on the left 
side.) Note that all inputs connected to a constant (whether direct from the expression tree or retrieved 
from VARLIST) must be defined as constant Inputs defined as constants have a small c next to die input 
arrow in Fig. 2. 

4*2.2 Conditional Integer Assignments 

For conditional if-then-else statements containing only integer assignments, objects for condition eval- 
uation are created first* The object event output indictating the condition result is kept for choosing 
the correct branch result later. Next, both branches are processed in parallel, using separate copies 
VARLJST1 and VARLIST2 of VARLIST. (VARLIST itself is not changed.) Finally, for all variables 
added to VARLIST1 or VARLIST2, a new entry for VARLIST is created (combination phase). The valid 
definitions from VARUST1 and VARLIST2 are combined with a MUX function, and the correct input 
is selected by the condition result For variables only defined in one of the two branches, the multiplexer 
uses the result retrieved from the original VARLIST for the other branch. If the original VARLIST does . 
not have an entry for this variable, a special "undefined" constant value is used. However, in a function- 
ally correct program this value will never be used. As an optimization, only variables live [1] after the 
if-then-else structure need to be added to VARLIST in the combination phase. 9 

Consider the following example: 

1 = 7;' 
a = 3; 

if (i < 10) { 
•a « 5; . ' 
c = 7,- 

} • 
else { 

c » a 1; 

d - 0;. 

} 

Kg. 3 shows the resulting CDF<3, Before the if-then-else construct, VARLIST = [{a, 3}, {i, 7}]. After 
processing the branches, for the then branch, VARLIST1 - : [{c, 7}, {a, 5}, {a, 3}, {i, 7}], and for the 
else branch, VAROST2 * [{d, 0}, {c, I}, {a, 3}, {i, 7}]. After combination, VARLIST = tfd, II}, fc, 
m},{a,IV},{a f 3},{i 9 7}]. ^ 

Note that case- or switch-statements can be processed, too, since they can - without loss of generality - 
be converted to nested if-then-else statements. 

Processing conditional statements this way does not require explicit control and does not change START. 
Both branches are executed in parallel and synchronized by the data-flow. It is possible to pipeline the 
dataflow for optimal throughput, 

R Note that the input and output names can be deduced from their position, cL Fig. 1. Also note that the compiler from- 
end would normally have substituted. the second assignment by b * 13 (constant propagation). For the simplicity of this 
explanation, no frontend optimi2aitons are considered in this and the following examples. 

9 Definirion; A variable i$ live at a program point if its value is read at a staiement reachable from here without intermediate 
redefinition. 
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4.2*3 GeweiraS Caj^dlSffiomail Statements 

Conditional statements containing either assay accesses (cf. Section 4.2.7 below) or inner loops cannot 
4 be processed as described in Section 4.2.2, Data packets must only be sent to the active branch. This is 
gacWeved by the implementation shown in Fig. 8, similar to the method presented in [4]. 

A dataflow analysis is performed to compute used sets use and defined sets def {!] of both branches. 10 
For the current VARLIST entries of all variables in IN = use(thenbody) U def (thenbody) U 
use(elsebody) U def (elsebody) U use(header) t DEMUX functions controlled by the IF condition are 
inserted. Note chat arrows with double lines in Fig. 8 denote connections for all variables in IN, amd the 
shaded DEMUX function stands for several DEMUX functions, one for each variable in IN. The DE- 
MUX functions forward data packets only to the selected branch. New lists VARLISTi and VARLIST2 
are compiled with the respective outputs of these DEMUX functions. The then-branch is processed with 
VARLISTI, and the else branch with VARLIST2, Finally, the output values are combined. OUT era- 
tains the new values for the same variables as in IN, Since only one branch is ever activated ahere will not 
be a conflict due to two packets arriving simultanuously. The combinations will be added to VAJRLIST 
after the conditioraal statement. 1^ the IF execution shall be pipelined, MERGE opcodes for the output 
must be inserted, too. They are controlled by the condition like the DEMUX functions. 

^ J The following extension with respect to [4] is added (dotted lines in Fir. 8) in order to control the execu- 
\ I tion as mentioned above with START events: The STAljsT input is ECOMB-combined with the condition 
i&j • I' output and connected to the SEL input of the DEMUX functions. The START inputs of thenbody and 
I elsebody are generated from the ECOMB output sent 'through a 1 -FILTER and a 0«CONSTANT n or 
j through a O-FILTER, respectively. The overall STAJRT new output is generated by a simple "2 to 1 
- 1 connection" of thenbody's and elsebody's ST ART new outputs. With this extension, arbitrarily nested 
\ conditional statements or loops can be handled within thenbody and elsebody. 

4.2.4 WHILE Loops 

WHILE loops are processed similarly to the scheme presented in [4], cf. Fig. 9. As in Section 4.23, dou- 
ble line connections and shaded MERGE and DEMUX functions represent duplication for all variables 
in IN. Here IN = use(whilebody) U def(wkilebody) U use(header). The WHILE loop executes as 
follws: In the first loop iteration, the MERGE functions select all input values from VARLIST at loop 
entry (SEL=0). The MERGE outputs are connected to the header and .the DEMUX functions. If the 
while condition is true (SEL=*1), the input values are forwarded to the whilebody, otherwise to OUT. 
The output values of the while body are fed back to whiJebody's input via the MERGE and DEMUX 

(operators as long as the condition is true. Finally, after the last iteration, they are forwarded to OUT The 
outputs are added to the new VARLIST. 12 

^ f Two extensions with respect to [4] are ad ded (dotted lines in Fir. 9): 

• l&^fx ' * variabIe is * esed in a scatemenl fe** he « c e to a program region containing this statement) if its value is read. A variable 
v is defined in a statement (or region) if a new value is assigned to it. 

n The 0-CONSTANT is required since START events must always be Q-evenis. 

"Note that the MERGE function for variables not live at the loop's beginning and the whilebody s beginning can be removed 
since its output is not used. For these variables, only the DEMUX funcaon to output the final value is required. Also note that 
the MERGE functions can be replaced by simple "2 to 1 connections" if the configuration process guarantees that packets from 
INI always arrive at the DEMUX's input before feedback values arrive. 
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' o In (43, the SEL- input of the MERGE functions is preloaded with 0. Hence the loop execution 
begins immediately and can be executed only once. Instead, we connect the START input to the 
MERGE's SEL input ("2 to 1 connection" with the header output). This allows to control the time 
of the start of the loop execution and to restart k. 

o The whilebody's START input is connected to the header output, sent through a 1 -FELTBR/Q- 

! CONSTANT combination as above (generates a 0-event for each loop iteration). By ECOMB- 
combining whilebody's 3TART mew output with the header output for the MERGE functions' 
SEL inputs, the next loop iteration is only started after the previous one has finished. The while 
loop's STAKTnew output is generated by filtering the header oumut for a 0-event 

Wiih these extensions, arbitrarily nested conditional statements, or loops can be handled within while- 
body. 



4.2.S F01& Loops 

I FOR loops are particularly regular WHILE loops. Therefore we could handle them as explained above. 
|However, our RDFP features the special counter function CNT and die data packet multiplication func- 
ition MDATA which can be used for a more efficient implementation of FOR loops. This new FOR loop 
[scheme is shown in Fig.' 10. 

A FOR loop is controlled by a counter CNT. The lower bound (LB), upper bound (UB), and increment 
(INC) expressions are evaluated like any other expressions (see Sections 4.2.1 and 4.2.7) and connected 
to die respective inputs. 

As opposed to WHILE loops, a MERGE/DEMUX combination is only required for variables in INI = 
def (forbody), L e. those defined in forbody. 13 3N1 does not contain variables, which are only used 
in forbody, LB, UB, or INC, and does also not contain the loop index variable.- Variables in INI are 
processed as in WHILE loops, but the MERGE and DEMUX functions* SEL input is connected to 
CNTs W output. (The W output does the inverse of a WHILE loop's header output; it outputs a 1- 
| event after * e counter has terminated. Therefore the inputs of the MERGE functions and the outputs 
1 of the DEMUX functions are swapped here, and the MERGE functions* SEL inputs are preloaded with 
I 1 -events.) 

CNTs X output provides the current value of the loop index variable. If the final index value is required 
(live) after the FOR loop, it is selected with a DEMUX function controlled by CNT's U event output 
(which produces one event for every loop iteration). 

Variables in IN2 = use(forbody) \ def (forbody), I e. those denned outside the loop and only used 
(but not redefined) inside the loop are handled differently. Unless it is a constant value, the variable's 
input value (from VARLIST) must be reproduced in each loop iteration since it is consumed in each 
tterauon. Otherwise the loop would stall from the second iteration onwards. The packets are reproduced 
l °y MDATA. functions, with the SEL inputs connected to CNTs U output. The SEL inputs must be 
1 preloaded with a 1 -©vent to select the first input. The 1 -event provided by the last iteration selects a new 
: value for the ne xt execution of the entire loop. 

• "Note that the MERGE functions can be replaced bysimple "2 to 1 connecuons" as for WHILE loops if the confiscation 
process guarantees that packets from INJ always arrive at the DEMUX's input before feedback values arrive. 
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• extensions, but 
Pnw is generated 

output, sent through a 1 -FILTER and O-CQNSTANT. CNT's V output produces one 0- 
event for each loop iteration and is therefore used as forbody's START. Finally, CNT's NEXT input is 
connected to fosbodfs STAMT^ output - 

For pipelined loops (as defined below in Section 4.2.6), loop iterations are allowed to overlap., fnerefoie 
CNT's NEXT input.needs not be connected. Now the counter produces index variable values and control 
events as fast as they can be consumed. However, in Ms case CNT's W output in not sufficient as overall 
STAHTnew output since the counter terminates before die last iteration's forbody finishes. Instead, 
START new is generated from CNT's U output ECQMB-combined with forbody's STAJtT ncw output 
sent through a 1 -HLTER/Q-CONSTANT combination. The ECOMB produces an event after termination 
of each loop iteration, but, only die last event is a 1 -event because only die last output of CNT's U output 
is a Irevent Hence this event indicates that the last iteration has finished, Cf. Section 4.3 for a FOR'loop 
example compilation with and without pipelining. 

As for WHILE loops, these methods allow to process arbitrarily nested loops and conditional statements 
The following advantages over WHILE loop implementations are achieved: 

o One index variable value is generated by the CNT function each clock cycle. This is fester and 
smaller than the WHILE loop implementation which allocates a MERGE/DEMUX/ADD loop and 
a comparator for tibe counter functionality. 

o Variables in IN2 (only used in forbody) are reproduced in the special MDATA functions and need 
not go through a MERGE/DEMUX loop. This is again faster and smailer than the WHILE loon 
implementation. 

» • 

4.2.<S Veetorfeafiaoa aaa# MpeMrag 

The method described so fer generates CDFGs performing the HLL program's functionality on an RBFP. 
However, the program execution is unduly sequentiaiized by the START signals, in some cases, inner- 
most loops can be vectorized. This means that loop iterations can overlap, leading to a pipelined dataflow 
through the operators of the loop body. The Pipeline Vectorization technique f6] can be easily applied to 
the compilation method presented here. As mentioned above, for FOR loops, the CNTs NEXT input is 
removed so that CNT counts continuously, thereby overlapping the loop iterations. T 

AD loops without array accesses can be pipelined since the dataflow automatically synchronizes hop- 
camed dependences, t. e. dependences between a statement in one iteration and another statement in a 
subsequent iteration. Loops with array accesses can be pipelined if the array (i.e. RAM) accesses 'do 
not cause loop-carried dependences or can be transformed to such a form. In this case no RAM address 
js written in one and read in a subsequent iteration. Therefore the read and write accesses to the same 
RAM may overlap. This degree of freedom is exploited in the RAM access technique described below 
Especially for dual-ported RAM it leads to considerable performance improvements. 

■4X7 Array Accesses 

In contrast to scalar variables, array accesses have to be controlled explicitly in order to maintain the 
program's correct execution order. As opposed to normal dataflow machine models [3], a RDFP does 
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• not have a single address space. Instead, the anays are allocated to several RAMs. This leads to a 
different approach to handling RAM accesses and opens up new opportunities for optimization. 

t To reduce fee complexity of the compilation process, array accesses aye processed in two phases. Phase 
\ 1 uses ^pseudo-functions 9 ' for RAM read and write accesses. A RAM read function has a RD data input 
I (read address) and an OUT data output (read value), and a RAM write function has WR and IN data 
[ inputs (write address and write value). Both functions are labeled with the axray the access refeis to, arid 
{ both have a START eventinput and a U event output. The events control die access order. In Phase 2 all 
accesses to the same RAM are combined and substituted by a single RAM function as shown in Fig. 1. 
This involves manipulating she data and event inputs and outputs such that the correct execution oxder is 
maintained and die outputs are forwarded to the correct part of the CDFG. 

^IFhase 3. Since assays are allocated to several RAMs, only accesses to the same RAM have to be syn- 
r Jchronized Accesses, to different RAMs can occur concurrently or even out of order. In case of data 
^ dependencies, fee accesses self-synchronize automatically. Within pipelined loops, not even read and 
write accesses to the same RAM have to be synchronized. This is achieved by maintainSag separate 
START signals for every RAM or even separate START signals for RAM read and RAM write accesses 
in pipelined loops. At the end of a basic block [l] 14 , all START^w outputs must be combined by a 
ECOMB to provide a START sipaal for the next basic block which guarantees that all array accesses in 
the previous basic biock are completed. For pipelined loops, this condition can even be relaxed. Only 
after the loop exit all accesses have to be completed. The individual loop iterations need not be synchro- 
nized. 

First the RAM addresses are computed. The compiler frontend's standard transformation for array ac- 
cesses can be used, and a CDFG function's output is generated which provides the address. If applicable, 
the offset with respect to the RDFP RAM (as determined in the initial mapping phase) must be added. 
This output is connected to the pseudo RAM read's RD input (for a read access) or to the pseudo RAM 
writes WR input (for a write access). Additionally, the OUT output (read) or SN input (write) is con- 
nected. The START input is connected to the variable's START signal, and the U output is used as 
STARTS for the next access. 

To avoid redundant read accesses, RAM reads are also registered in VARLIST. Instead of an integer 
variable, an array element is used as first element of the pair. However, a change in a variable occurring 
in an array index invalidates the information in VARLIST. It must then be removed from it 

I The following example with two read accesses compiles to the intermediate CDFG shown in Fig. 12. The 
START signals refer only to variable a, STOP1 is the event connection which sync!m>ni2eslHcn5c»sses. 
Inputs START (old), i and j should be substituted by the actual outputs resulting from the program before 
the array reads. 

x.-.aUJ; 
y - atjl; 
Z = x + y; 

jf Kg. 13 shows the translation of the following write access: 
a[i] - x? 

lA A basic block is a program pan with a single entry and a single exit point, i.e. a piece of straight-line code.* 
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FJaase 2 We now merge the pseudo-functions of all accesses to she same RAM and substitute them by 
a single RAM Amotion. For all data inputs (RD for read access and WR and IN for write access), GATEs 
are inserted between fee input and the RAM function. Their E inputs are connected to the respective 
« START inputs of the original pseudo-fimctions. If a RAM is read and written at only one program point, 
the U output of the read and write access is moved to the ERD or EWR output, respectively. For example, 
fine single access a [ i ] * x; from Fig- 13 is transformed to the final CDFG shown in Fig, 5. 

\ 

* However, if several read or several write accesses (L e. pseudo-functions from different program points) 
.go the same RAM occur, the ERD or EWR events are not specific anymore. But a STARTnew event of 
the original pseudb function should only be generated for the respective program point, i. e. for the 'cur- 
rent access. This is achieved by connecting the START signals of all other accesses (pseudo-functions) 
of die same type (read or write) with the inverted START signal of the current access. The result- 
ing signal produces an event for every access, but only for the current access a 1-eveut. This event is 
BCOMB-combined with the RAM's ERD or EWR output. The ECOMB 's output will only occur after 
the access is completed. Because ECOMB OR°combines its event packets,, only the current access pro- 
duces a 1 -event. Next, this event is filtered with a 1 -FILTER and changed by a O-CQNSTANT, resulting 
in a START new signal which produces a O-event only after the current access is completed as required. 

For several accesses, several sources are connected to the RD, WR and IN inputs of a RAM. This disables 
the self-synchronization. However, since only one access occurs at a time, the GATEs only allow one 
data packet to axxive at the inputs. 

For read accesses, fee packets at the OUT output face the same problem as the ERD event packets: 
IThey occur for every read access, but must only be used (and forwarded to subsequent operators) for 
the current access. This can be achieved by connecting the OUT output via a DEMUX function. The Y 
output of the DEMUXis used, andthe X output is left unconnected. Then it acts as a selective gate which 
only forwards packets if its SEL input receives a 1 -event, and discards its data input if SEL receives a 
O-event. The signal created by the ECOMB described above for the ST ART new signal creates a 1 -event 
for the current access, and a O-event otherwise. Using it as the SEL input achieves exactly the desired 
functionality. 

Fig. 4 shows the resulting CDFG for the first example above (two read accesses), after applying die 
transformations of Phase 2 to Fig. 12. ST0P1 is now generated as foDws: STAJRX(oId) is inverted, 
4ft 2 to 1 connected" to STOPS (because.it is the START input of the* second read pseudo-function), 
ECOMB-combined with RAM's ERD output and sent through the 1 -FILTER/O-CONSTANT combxna- 

• don. START(new)is generated similarly, but here START(old) is directly used and STOPI inverted. The 
I GATEs for input DM (i and j) are connected to START(old) and STOPI , respectively, and the DEMUX 
{ functions for outputs x and y are connected to the ECOMB outputs related to STOPI and START(new). 

Multiple write accesses use the same control events, but instead of one GATE per access for the RD 
inputs, one GATE for WR and one gate for IN (with the same E input) are used. The EWR output is 
processed like the ERD output for read accesses. 

This transformation ensures that all RAM accesses are executed correctly, but it is not very fast since read 
or write accesses to the same RAM are not pipelined. The next access only starts after the previous one 
is completed, even if the RAM being used has several pipeline stages. This inefficiency can be removed 
as follws: 

First continuous sequences of either read accesses or write accesses (not mixed) within a basic block are 
detected by checking for pseudo-functions whose U output is directly connected to the START input of 
.another pseudo-function of the same RAM and the same type (read or write). For these sequences, it is 
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possible to stream data into the RAM rather than waiting for die previous access to complete. For mis 
purpose, a combination of MERGE Amotions selects the KB or WR and IN inputs m the order given 
by the sequence. The MERGES must be controlled by iterative ESEQs guaranteeing that the inputs ate 
only forwarded in fee desired order. Then only the first access in die sequence needs to be controlled by 
a GATE or GATEs. Similarly, the OUT outputs of a read access can be distributed more efficiently for 
I a sequence. A combination of DBMUX functions with the same ESEQ control can be used.; It is most 
• efficient to arrange she MERGE and DEMUX functions as balanced binary treesj 

The ST ARTneto signal is generated as follows: For a sequence of length n, the START signal of the 
entire sequence is seplicated n times by an ESEQ[OO.JJ function with the START input connected to 
the sequence's START. Its output is direcdy "N to 1 connected" with the other accesses' START signal 
(for single accesses) or ESEQ outputs sent through 0-CONSTANT (for access sequeaces), ECOMB- 
eonneeted to EWR or ERD, respectively, and sent through a I -FILTER/O-CONSTANT combination 
similar to the basic method described above, Since only the last ESEQ output is a 1 -event, only the 
lest RAM access generates a STAET new as required. Alternatively, for read accesses, the generation 
of the last output can be sent through a GATE (without the E input connected), thereby producing a 
START new evsnt. . 

Fig. 14 shows me optimized version of the first example (Figures 12 and 4) using the ESEQ-method for 
generating START n&0 , and Fig. 6 shows the final CDFG of the following, larger example with three 
array reads. Here the latter method for producing the STAUT^ event is used. 

x = a[ij; 
y - arjl; 
z = afk]; 

If several read sequences or read sequences and single read accesses occur for the same RAM, 1-events 
for detecting tbe current accesses must be generated for sequences of read accesses. They are needed 

t °^™^t£ UT " Values reIating 10 separate set l ueDces - The ESEQ output just defined, sent through 
a i ^CONSTANT, achieves this. It is again "N to 1 connected" to the other accesses' START signals 
(for single accesses) or ESEQ outputs sent through 0-CONSTANT (for access sequences). The resulting 
event is used to control a first-stage DEMUX which is inserted to select the relevant OUT output data 
packete of the sequence as described above for the basic method. Refer to the second example (Figures 
r 15 and 16) m Section 4.3 for a complete example. 



Input and output ports are processed similar to vector accesses. A read from an input port is like an 
array read without an address. The input data packet is sent to DEMUX functions which send it to the 
correct subsequent operators. The STOP signal is generated in the same way as described above for 
RAM accesses by combining the IMPORT'S U output with the current and other START signals. 

Output ports control the data packets by GATEs like array write accesses. The STOP sisnal is also 
created as for RAM accesses. * 
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4.3 More Examples 

Fig. 7 shows the generated CDFG for the following for loop. , 
a * b + c; 

for (±=0; i<=»l0; i++) { 
a ■» a + i; 
x[i] - k? 

} 

In this example, INI - {a> and I1V2 = {fc} (cf. Fig. 10). The MERGE function for variable a is 
replaced by a 2:1 data connection as mentioned in the footnote of Section 4.2.5. Note that only one 
data packet arrives for variables b, c and k, and one final packet is produced for a (out), forbody does 
not use a START event since both operations (the adder and the RAM write) are dataflow-controlled 
by the counter anyway. But the RAM's EWR output is the forbody' s START nea and connected to 
CNT's NEXT input Note that the pipelining optimization, cf. Section 4.2.6, was not applied here. If it 
' is applied (which is possible for this loop), CNTs NEXT input is not connected, cf. Fig. 1 1. Here, the 
loop iterations overlap. START new is generated from CNT's U output and forbody's ST ART new (i.e. 
RAM's EWR output), as defined at the end of Section 4.2.5. 
t The following program contains a vectorizable (pipelined) loop with one write access to array (RAM) x 
I and a sequence of two read accesses to array (RAM) y. After the loop, another single read access to y 



occurs. 



2=0;- 

fox (i=0;'i<=10? i++> { 
x[i] = i; 

z - z + yCU + y[2*i]; 

1 

a - ytkj; 

Fig. 15 shows the intermediate CDFG generated before the array access Phase 2 transformation is ap- 
plied. The pipelined loop is controlled as follows: Within the loop, separate START signals for write 
accesses, to x and read accesses to y.are used The reentry to the forbody is also controlled by two in- 
dependent signals ("cyclel" and "cycle2"). For the read accesses, "cycle2" guarantees that the read y 
accesses occur in the correct order. But the beginning of an iteration for read y and write x accesses is 
not synchronized. Only at loop exit all accesses must be finished, which is guaranteed by signal "loop 
finished". The single read access is completely independent of the loop. 

Fig. 16 shows the final CDFG after Phase 2. Note that "cyclel" is removed since a single write access 
needs no additional control, and "cycle2" is removed since the inserted MERGE and DEMUX functions 
automatically guarantee the correct execution order. The read y accesses are not independent anymore 
since they all refer to the same RAM, and the functions have been merged. ESEQs have been allocated 
to control the MERGE and DEMUX functions of the read sequence, and for the first-stage DEMUX 
functions which separate the read OUT values for the read sequence and for the final single-read access. 
| The ECOMBs, 1 -FILTERS, 0-CONSTANTs and I -CONSTANTS are allocated as described in Section 
{ 4.2.7, Phase 2, to generate correct control events for the GATEs and DEMUX functions. 
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